Here’s the link to the repo on github: https://github.com/oj-lappi/IODS-project
I’d like to learn about statistical tests and possibly how to add R to data pipelines.
I don’t think R will replace my current tools for analyzing and plotting data, but I do want to at least see how feasible it would be to add R as a tool for developing statistical analyses for experiments, and possibly writing some small programs for quick statistical analysis in postprocessing of experiments.
I’m familiar with git, it’s a daily driver for me, I tend to use gnuplot and python for quick plotting, and either use python, UNIX tools or custom C++ programs for data analysis utilities.
I’m interested in creating automatic data pipelines that process data in object storage and upload plots to a dashboard. R+Github pages might be a good way to do that, although I would prefer an on-prem solution (HU’s gitlab maybe?) or a self-hosted website.
I have ssh keys set up on each of my computers, and a personal access token would be less flexible than that.
I can push and pull with this setup from RStudio.
I checked the exercises and followed some of them along locally, I think I’m getting the hang of it :).
# Timestamp:
date()
## [1] "Thu Nov 30 16:15:51 2023"
#Dependencies
#install.packages(c("readr","lmtest", "dplyr"))
library(readr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
# Timestamp
date()
## [1] "Thu Nov 30 16:15:51 2023"
lrn2014 <- read_csv('data/learning2014.csv')
## Rows: 166 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): gender
## dbl (6): age, attitude, deep, stra, surf, points
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## [1] "dimensions: 166" "dimensions: 7"
There are 166 rows with 7 columns each.
The spec is:
## cols(
## gender = col_character(),
## age = col_double(),
## attitude = col_double(),
## deep = col_double(),
## stra = col_double(),
## surf = col_double(),
## points = col_double()
## )
gender seems to be a categorical value, so let’s see the
number of rows per gender (sex):
## # A tibble: 2 × 2
## gender count
## <chr> <int>
## 1 F 110
## 2 M 56
110 F, and 56 M, there is a skew towards female students in the dataset. Let’s plot that.
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
fig. 1.1, sex distributions
Let’s plot the age distribution of the students.
lrn2014 %>%
ggplot(aes(x = age)) +
stat_count()
fig. 1.2, age distributions
min(lrn2014$age)
## [1] 17
max(lrn2014$age)
## [1] 55
median(lrn2014$age)
## [1] 22
The age range is from 17 to 55, and the median is 22. Visually inspecting the distribution, the mode of the distribution is early twenties, as you would expect, although there is a long tail.
Let’s combine the two columns into a classic population pyramid, or age-sex pyramid.
But that’s not exactly what we want. It turns out a population pyramid is not an out-of-the-box plot we can easily produce, we have to manually calculate bins and do some magic.
lrn2014 %>%
mutate(age_bin = cut(age, breaks=seq(0,60,5))) %>% # Bin by age
group_by(gender, age_bin) %>% # Group by bin and gender
summarize(count =n()) %>% # Sum over groups
mutate(count =
if_else(gender == 'M', -count, count)) %>% # Turn one negative
ggplot(aes(x=count, y = age_bin)) +
geom_col(aes(fill=gender))
## `summarise()` has grouped output by 'gender'. You can override using the
## `.groups` argument.
fig. 1.3a, population pyramid
There are very few male students under 20, I would speculate that this is due to Finnish army conscription, otherwise the distribution seems roughly equal on the female and male sides.
We can of course bin by year instead of 5 years, and we get a higher resolution but more noise.
## `summarise()` has grouped output by 'gender'. You can override using the
## `.groups` argument.
fig. 1.3b, population pyramid #2, fine-grained bins
There is one peculiar decline in the female student participation around ~26-28 which jumps back after thirty. This might be a maternity effect, but this is highly speculative, there’s very few samples in this dataset.
Let’s look at exam scores:
paste("median:", median(lrn2014$points), ", mean:",mean(lrn2014$points), ", standard deviation:", sd(lrn2014$points))
## [1] "median: 23 , mean: 22.7168674698795 , standard deviation: 5.89488364245529"
Let’s look at the full distribution usin
geom_density.
lrn2014 %>%
ggplot(aes(x=points)) +
geom_density()
fig. 1.4a, exam score distribution
There’s a curious valley in the density at around 12-14 points. Let’s look closer.
lrn2014 %>%
ggplot(aes(x=points, tickwidth=1)) +
geom_histogram(boundary=0,binwidth=1) +
scale_x_continuous(breaks = seq(0, 40, by = 1))
fig. 1.4b, exam score histogram
So no students got 12,13, or 14 points. The jump from 11 to 15 must be behind some barrier. Let’s look at our two demographic variables.
lrn2014 %>%
group_by(gender) %>%
ggplot(aes(x=points)) +
geom_histogram(boundary=0,binwidth=2, aes(fill=factor(age))) +
facet_wrap(~gender)
fig. 1.5a, exam scores by sex, with age gradient
I see no clear bias either way, perhaps a slight correlation with score and age within the mode (20-30 years) and then no correlation for higher ages. The female distribution seems most well-behaved, although the gap from 11 points up is much sharper here as well.
lrn2014 %>%
mutate(age_bin = cut(age, breaks=seq(0,60,5))) %>% # Bin by age
group_by(gender, age_bin) %>%
ggplot(aes(x=points)) +
geom_histogram(boundary=0,binwidth=2, aes(fill=gender)) +
facet_wrap(~age_bin)
fig. 1.5b, exam scores by age group, with labeled sex
lrn2014 %>%
mutate(age_bin = cut(age, breaks=seq(0,60,5))) %>% # Bin by age
ggplot(aes(x=age_bin, y = points)) +
stat_boxplot(geom = "errorbar", width = 0.25) +
geom_boxplot()
fig. 1.6a, exam score distributions by age group, box plot sequence
lrn2014 %>%
mutate(age_bin = cut(age, breaks=seq(0,60,5))) %>% # Bin by age
ggplot(aes(x=age_bin, y = points)) +
stat_boxplot(geom = "errorbar", width = 0.25) +
geom_boxplot()+
facet_wrap(~gender)
fig. 1.6b, exam score distributions by age-sex group, box plot sequence
I see no correlation between age and exam score in these plots.
Finally, let us look at all the survey question scores together with the already explored variables in a correlation matrix.
lrn2014 %>%
select(gender, age, surf, stra, deep, attitude, points) %>%
ggpairs(aes(fill=gender, color=gender),lower=list(combo=wrap("facethist",binwidth=0.5))) # Have to add the lower arg so ggpairs doesn't complain
fig. 1.7, correlation matrix of all variables
There seem to be negative correlations between: - surf
and deep (but seemingly only strongly for male students) -
attitude and deep (weak, but stronger for male
students again) - stra and surf (weak)
And a possitive correlation between points and
attitude! This is the strongest linear relationship in the
data. And we can verify that age does not seem to have an effect at
all.
There also seems to be a relationship between attitude
and gender, but no relationship between
attitude and points.
Based on the data exploration, attitude seems the most
likely candidate, and next after that: stra and
surf.
Let’s start with a simple regression model of
points ~ attitude, which will be our baseline.
attitude_model <- lrn2014 %>%
lm(points ~ attitude, data = .)
summary(attitude_model)
##
## Call:
## lm(formula = points ~ attitude, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
There is clearly a statistically significant relationship between
attitude and points. But R-squared is only
around 18.5%, so there is a lot of variance not explained by the
model.
three_var_model <- lrn2014 %>%
lm(points ~ attitude + stra + surf, data = .)
summary(three_var_model)
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## stra 0.8531 0.5416 1.575 0.11716
## surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
The adjusted R-squared is higher, which means our model is capturing more of the underlying interactions than before, although still below 20%.
It seems that the relationship between points and
surf is not statistically significant.
Let’s drop surf and keep stra, and try
again.
two_var_model <- lrn2014 %>%
lm(points ~ attitude + stra, data = .)
summary(two_var_model)
##
## Call:
## lm(formula = points ~ attitude + stra, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.6436 -3.3113 0.5575 3.7928 10.9295
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.9729 2.3959 3.745 0.00025 ***
## attitude 3.4658 0.5652 6.132 6.31e-09 ***
## stra 0.9137 0.5345 1.709 0.08927 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared: 0.2048, Adjusted R-squared: 0.1951
## F-statistic: 20.99 on 2 and 163 DF, p-value: 7.734e-09
Right, this is our best fit yet, based on the adjusted R-squared.
Depending on what our choice of significance level would be in a
hypothesis test, the interaction with stra would be
ignored. At a standard level of a = 0.05, we wouldn’t
reject the null hypothesis that stra has a linear effect on
points.
Let’s test another model, with nothing but stra.
strastra_model <- lrn2014 %>%
lm(points ~ stra, data = .)
summary(stra_model)
##
## Call:
## lm(formula = points ~ stra, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.5581 -3.8198 0.1042 4.3024 10.1394
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.233 1.897 10.141 <2e-16 ***
## stra 1.116 0.590 1.892 0.0603 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.849 on 164 degrees of freedom
## Multiple R-squared: 0.02135, Adjusted R-squared: 0.01538
## F-statistic: 3.578 on 1 and 164 DF, p-value: 0.06031
We get very close to a statistically significant result at a standard
significance level of 0.05, but not quite. Let’s drop
stra, since that’s what the assignment asked us to do.
The model we will be using is therefore the baseline model with one
predictor: attitude.
Let’s rerun that summary:
summary(attitude_model)
##
## Call:
## lm(formula = points ~ attitude, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
The fitted regression coefficients are: - intercept: 11.6372 -
attitude: 3.5255
Which means the that the model predicts the conditional mean of the
exam scores, given an attitude value as:
points = 11.6372 + 3.5255*attitude
If we multiply the attitude coefficient with the range of the
attitude in the population, we can get an idea of how the model assigns
expected exam scores based on a students attitude.
am <- mean(lrn2014$attitude)
as <- sd(lrn2014$attitude)
print("Range of predictor term within a sd:")
## [1] "Range of predictor term within a sd:"
3.5255*c(am-as, am, am+as)
## [1] 8.506549 11.079839 13.653130
print("Range of ŷ_i within a sd:")
## [1] "Range of ŷ_i within a sd:"
11.6732 + 3.5255*c(am-as, am, am+as)
## [1] 20.17975 22.75304 25.32633
print("Range of ŷ_i within two sd's:")
## [1] "Range of ŷ_i within two sd's:"
11.6732 + 3.5255*c(am-2*as, am, am+2*as)
## [1] 17.60646 22.75304 27.89962
spec(lrn2014)
## cols(
## gender = col_character(),
## age = col_double(),
## attitude = col_double(),
## deep = col_double(),
## stra = col_double(),
## surf = col_double(),
## points = col_double()
## )
So, assuming attitude is Gaussian and that the sample
stddev is a good estimate, the model assigns: an exam score: - between
20.17975 and 25.32633 to a majority of students (about 68%, one stddev
in a Gaussian captures 34.1% of the population) - between 17.60646 and
27.89962 to a super-majority (95%) of students
If we look back at the exam score distribution in figure 1.4, this does capture the mode of the distribution.
The multiple R-squared value is 0.1906, the standard way to express
this is that “19% of the variation in exam scores is explained by the
attitude variable” (see MABS4IODS, 3.2.1).
I would interpret this to mean that, using the linear model, given
attitude we estimate the expectation of the standard error
(squared) of this prediction to be roughly 80% of what it would be when
simply using the sample mean as a predictor. The estimation assumes the
sample is representative, because we’re using residuals to get this
number.
I’m not quite sure if this more elaborate interpretation is exact, but it’s what I was able to piece together from sources online, mostly wikipedia (https://en.wikipedia.org/wiki/Coefficient_of_determination).
par(mfrow=c(2,2))
plot(attitude_model)
2.0 Regression diagnostics
par(mfrow=c(1,1))
The assumptions of the model are that: 1. the constant variance assumption (homoscedasticity) 2. the normality assumption: 3. there’s a linear relationship between the predictor and the response
The Residuals vs. Leverage plot checks that there aren’t any influential outliers that are affecting the fit of the regression coefficients. The plot has a dashed line showing a critical Cooks’s distance value. In our case this dashed line is not visible. Essentially, if a point is very far right and has a high standardized residual (off the central line), it’s an higly influential point and will have to be looked into.
A highly influential point may be an indication of a point that should be excluded, but this has to be done on a case-by-case basis. It might also mean that assumption 3 is violated, that there isn’t a linear relationship between predictors and the response.
The residuals vs fitted plot (and the scale-location plot) gives us a visual way to check for heteroscedasticity (violation of assumption 1). If the red mean-line is not horizontal, it means the residuals have a bias in some region of the response distribution (the vairance is not constant).
I don’t think this means the data is heteroscedastic, it certainly
doesn’t look like it is. But I’m not so familiar with these visual
checks, so I searched the web for a homoscedasticity test, and found the
Breusch-Pagan Test in the lmtest library.
attitude_model_bptest <- bptest(attitude_model)
attitude_model_bptest
##
## studentized Breusch-Pagan test
##
## data: attitude_model
## BP = 0.0039163, df = 1, p-value = 0.9501
A p-value of 0.95 means that there is definitely very
little evidence that the model is heteroscedastic. Ok, good! The trend
has to be much clearer than that then.
This plot compares the “ideal” quantiles to the sample quantiles. This is used to test for normalcy by comparing the residual distribution against a theoretical normal distribution.
There are some outliers at the bottom left, which may indicate a bimodal distribution (remember that the exam scores looked bimodal as well, fig 1.4). The plot also curves down a little at the upper end, which I believe usually indicates a light left skew.
let’s test the residuals for normalcy, using the Shapiro-Wilk test:
shapiro.test(attitude_model[["residuals"]])
##
## Shapiro-Wilk normality test
##
## data: attitude_model[["residuals"]]
## W = 0.97394, p-value = 0.003195
The p-value is quite small, 0.003, so we to reject the null hypothesis that the residuals are normally distributed.
Hmm, maybe the issue is the grading scale? I’ve tried to fix this, and I can ge the QQ-plot to look nicer with some transformations, but it makes the model test statistics worse. This is the best I can do for now.
library(dplyr)
library(readr)
data_url <- "https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/alc.csv"
df <- read_csv(data_url)
## Rows: 370 Columns: 35
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (17): school, sex, address, famsize, Pstatus, Mjob, Fjob, reason, guardi...
## dbl (17): age, Medu, Fedu, traveltime, studytime, famrel, freetime, goout, D...
## lgl (1): high_use
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
The variables in the data set are:
colnames(df)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "guardian" "traveltime" "studytime" "schoolsup"
## [16] "famsup" "activities" "nursery" "higher" "internet"
## [21] "romantic" "famrel" "freetime" "goout" "Dalc"
## [26] "Walc" "health" "failures" "paid" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
The dataset contains student school performance data averaged over two courses: portuguese and mathematics (the variables are: G1, G2, G3, absences, failures, paid). To be exact, the “paid” column is only from one of the courses. The other variables are demographic and social, the social variables were collected using surveys.
Two variables: alc_use and high_use are transformations of the original dataset added for this assignment. alc_use is the average alcohol use per day, combining self-reported weekday and weekend usage on a scale of 1 to 5. high_use is a boolean indicating whether the alc_use variable is more than 2.
I choose:
My hypotheses is:
Numerically and graphically explore the distributions of your chosen variables and their relationships with alcohol consumption (use for example cross-tabulations, bar plots and box plots). Comment on your findings and compare the results of your exploration to your previously stated hypotheses. (0-5 points)
library(ggplot2)
library(GGally)
df %>% select(alc_use, high_use, Fedu, Medu, absences, G3) %>%
ggpairs(aes(fill=high_use, color=high_use))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
fig. 1.1, correlation matrix of variables, colored by high_use value
Two out of four hypotheses seem to have some support in the data after this exploration.
Let’s do a logistic regression model using the two statistically significant correlations: G3 and absences.
model <- glm(high_use ~ G3 + absences, data = df, family = "binomial")
summary(model)
##
## Call:
## glm(formula = high_use ~ G3 + absences, family = "binomial",
## data = df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3286 -0.8298 -0.7219 1.2113 1.9242
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.38732 0.43500 -0.890 0.373259
## G3 -0.07606 0.03553 -2.141 0.032311 *
## absences 0.08423 0.02309 3.648 0.000264 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 429.93 on 367 degrees of freedom
## AIC: 435.93
##
## Number of Fisher Scoring iterations: 4
Both G3 and absences have a p-value of less than 0.5 for the Wald tests of the fit, but we have a better test that’s easier to interpret, the confidence intervals from the model.
coeffs <- coefficients(model)
odds_ratio <- coeffs %>% exp
odds_ratio
## (Intercept) G3 absences
## 0.6788751 0.9267616 1.0878762
The odds ratios for G3 and absences are roughly 0.926 and 1.088 For G3, the OR is less than one because the correlation between the variables is negative.
The way to interpret these is that: - for each decrease in final grade average, the odds increase by roughly 8.0% that a student has high alcohol use (1/0.9267 ~= 1.08) - for each increase in absences per course, the odds increase by roughly 8.7% that a student has high alcohol use
But that’s just the average effect the model has fit, let’s look at confidence intervals in the odds ratio
ci <- confint(model) %>% exp
## Waiting for profiling to be done...
ci
## 2.5 % 97.5 %
## (Intercept) 0.2857593 1.5832545
## G3 0.8639686 0.9935498
## absences 1.0419618 1.1407755
The 95% confidence intervals are both on one side of unity, so we can say with 95% certainty that there is an effect for both variables, and the effect increases the odds by: - a factor in the range of [1.007, 1.15] for each decrease in final grade average (again,inverting the odds ratio because the effect is negative) - a factor in the range of [1.04, 1.14] for each increase in absences per course
There is an effect, although the final grade average effect goes damn near 1 at the low end of the confidence interval, the unit increase in odds is only 0.7%! For absences, the confidence interval is a little tighter, but it still seems a little wide to me for practical use. We would need more samples in order to get a tighter interval.
The two hypotheses that survived the initial exploration both match the outcomes of the logistic regression, and now we can quantify the quality of an estimate of the effect.
Let’s test out the model with a confusion matrix.
probabilities <- predict(model, type = "response")
df <- df %>%
mutate(probability = probabilities) %>%
mutate(prediction = probability > 0.5)
table(high_use = df$high_use, prediction = df$prediction) %>%
prop.table %>%
addmargins
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.68108108 0.01891892 0.70000000
## TRUE 0.25675676 0.04324324 0.30000000
## Sum 0.93783784 0.06216216 1.00000000
The confusion matrix tells us that the model has a false positive rate of ~2% and a false negative rate of ~26%. That’s pretty good! High false negative rates are not so bad, they are just missed cases. High false positives would mean that the model is unreliable and cannot be used as an indicator (in this case. The importance of different error types depend on the specific use case and the meaning of negative and positive).
On the other hand, the confusion matrix also tells us that the model preficts 94% of all students to have non-high alcohol use, while in reality the number is 70%. So the model achieves this relative safe indicator status by being rather conservative.
We haven’t done a split into a model fit dataset and a validation set, so this confusion matrix is of limited value.
This assignment is about classification and clustering. We’re looking at a dataset put together from sensus data, and seeing how crime rate varies across multiple correlated variables.
The methods used are linear discriminant analysis (LDA) and k-means clustering.
Load the Boston data from the MASS package. Explore the structure and the dimensions of the data and describe the dataset briefly, assuming the reader has no previous knowledge of it. Details about the Boston dataset can be seen for example here (https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/Boston.html). (0-1 points)
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
data("Boston")
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
There are 506 rows with 14 columns. The dataset seems to originally have been put together to analyze housing values across the suburbs. From the paper cited on the stat.ethz.ch site: “Harrison, D. and Rubinfeld, D.L. (1978) Hedonic housing prices and the demand for clean air”, we can find that a row represents a “SMSA census tract”, so, a census area in Boston containing some number of housing units.
The columns contain social statistics related to these census areas
(e.g. crim = crime rate, ptratio =
pupil-teacher ratio), data about the housing units in the area
(rm = avg # of rooms per unit, medv = median
housing unit value, age = prop. houses built before 1940),
and data about the location of the area (e.g. dis =
weighted mean of distances to employment centers, chas = 1
if by Charles River, 0 if not by Charles River).
Some of the columns are a little counter-intuitive or difficult to
interpret. E.g., the column age is the proportion of houses
built before 1940, and the column lstat is the proportion
of the population that is lower status. From the Harrison &
Rubinfield paper, lower status means: “1/2 * (proportion of adults
without some hig h school education and proportion of male workers
classified as laborers)”.
Ok, before we move forward, we did see a small issue here, let’s
change chas from an integer to a boolean.
library(dplyr)
Boston_explore <- Boston %>% mutate(chas = chas == 1)
str(Boston_explore)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
Show a graphical overview of the data and show summaries of the variables in the data. Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them. (0-2 points)
summary(Boston_explore)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Mode :logical
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 FALSE:471
## Median : 0.25651 Median : 0.00 Median : 9.69 TRUE :35
## Mean : 3.61352 Mean : 11.36 Mean :11.14
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10
## Max. :88.97620 Max. :100.00 Max. :27.74
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
Looking at this summary, all columns are definitely not created
equal. I am mostly looking at the difference between the mean and
median, which — if they differ by much — can indicate that a variable is
not normally distributed. Some of the columns are close to normal
distributions, but e.g. zn has a median of 0 and a mean of
11.36. Other highly skewed columns are rad,
crim, tax, chas, and
black.
library(ggplot2)
library(GGally)
Boston_explore %>%
ggpairs(lower=list(combo=wrap("facethist",binwidth=0.5))) # Have to add the lower arg so ggpairs doesn't complain
fig. 3.1, Correlation matrix
Ok, lot’s to unpack here, let’s start with the a visual check of each
variable’s distribution (the diagonal). Almost none of the columns look
normally distributed, with perhaps the exception of rm, the
number of rooms.
There are lots of interesting correlations, just looking at the
scatter plots, the three values rm, medv, and
lstat seem to have strikingly strong relationships with
each other with medv, which makes sense to me.
The rad variable, which is an “index of accessibility to
radial highways, is clearly a bimodal, or one could even say a split
distribution. A subset of areas have a much higher index than the
others, and in the scatter plots, this clearly visible line of that
higher-index population seems to consistently cover different ranges of
the other variable than the lower-index population. The effect is most
clearly noticeable in the crim, tax,
nox, dis and black scatter
plots.
dis and nox also have a strikingly
logarithmic-looking relationship.
In general, nearly every variable seems to be correlated with every
other variable, excepting the chas (area is by the Charles
river) column.
library(corrplot)
## corrplot 0.92 loaded
corrplot(cor(Boston_explore), method="circle")
fig. 3.2, Correlation matrix with corrplot
Using corrplot, we lose some information, but get a better overview of where the correlations are strongest.
We see strong correlations (large balls) between: - dis
and (zn, indus, nox, and
age) - tax and rad, and this is a
very strong correltaion, they seem to capture much of the same
variation within them - tax and (crim,
indus, and nox) - ditto for rad -
lstat and medv
Standardize the dataset and print out summaries of the scaled data. How did the variables change? Create a categorical variable of the crime rate in the Boston dataset (from the scaled crime rate). Use the quantiles as the break points in the categorical variable. Drop the old crime rate variable from the dataset. Divide the dataset to train and test sets, so that 80% of the data belongs to the train set. (0-2 points)
Let’s run the scale function on the dataset.
Boston_scaled <- as.data.frame(scale(Boston_explore))
summary(Boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
We’ve now normalized the columns by subtracting the mean and dividing by the standard deviation such that, if they were normally distributed, they would now be standard normally distributed.
# Create quartile class
Boston_scaled$crim <- as.numeric(Boston_scaled$crim)
Boston_scaled$crime <- cut(Boston_scaled$crim, breaks = quantile(Boston_scaled$crim), include.lowest = TRUE, label=c("low", "med_low", "med_high", "high"))
# Drop crim
Boston_scaled <- dplyr::select(Boston_scaled, -crim)
We’ve split the crime rate column into a categorical variable defining in which quartile of the crime rate distribution the sensus area is in.
set.seed(179693716)
n <- nrow(Boston_scaled)
split <- sample(n, size = n * 0.8)
train <- Boston_scaled[split,]
test <- Boston_scaled[-split,]
Now we’ve split the dataset into two: 80% of rows are in the training set, 20% are in the test set.
Fit the linear discriminant analysis on the train set. Use the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables. Draw the LDA (bi)plot. (0-3 points)
lda.fit <- lda(crime ~ ., data = train)
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2549505 0.2475248 0.2351485 0.2623762
##
## Group means:
## zn indus chas nox rm age
## low 0.85485581 -0.8956178 -0.08120770 -0.8436157 0.4413548 -0.8603410
## med_low -0.09212737 -0.2763238 0.04263895 -0.5633101 -0.1877614 -0.3785908
## med_high -0.38704309 0.1791469 0.26643202 0.3710782 0.1054813 0.4474274
## high -0.48724019 1.0170298 -0.04947434 1.0820880 -0.4230980 0.8279971
## dis rad tax ptratio black lstat
## low 0.8290333 -0.6897369 -0.7642825 -0.43374243 0.3887547 -0.77819536
## med_low 0.3902633 -0.5443053 -0.4238660 -0.08385135 0.3219258 -0.11887074
## med_high -0.3933098 -0.4088466 -0.3025121 -0.20555126 0.1295279 0.06516648
## high -0.8564037 1.6390172 1.5146914 0.78181164 -0.8395683 0.90684062
## medv
## low 0.52171440
## med_low -0.04934231
## med_high 0.14052867
## high -0.73164690
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.12991855 0.717286290 -1.11409479
## indus 0.01318562 -0.275547338 0.12479393
## chas -0.08510193 -0.049959856 0.17108503
## nox 0.50033989 -0.751262885 -1.27531579
## rm -0.10015731 -0.108297341 -0.19190232
## age 0.21057791 -0.358696692 -0.14782657
## dis -0.05526417 -0.347261139 0.38293342
## rad 3.16593191 1.032967549 -0.36643042
## tax -0.01168751 -0.096843332 1.10072573
## ptratio 0.12524469 0.004459516 -0.41009220
## black -0.12747580 -0.014860435 0.07573253
## lstat 0.22137355 -0.316638242 0.31488347
## medv 0.17976929 -0.444140572 -0.22238434
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9509 0.0340 0.0151
arrows <- function(x, scale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
graphics::arrows(x0 = 0, y0 = 0,
x1 = scale * heads[,choices[1]],
y1 = scale * heads[,choices[2]], col=color, length = arrow_heads)
text(scale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
classes <- as.numeric(train$crime)
plot(lda.fit, dimen = 2, col=classes, pch=classes)
arrows(lda.fit, scale = 1.5, color = "#ee8855")
fig. 5.1 LDA biplot
We see that out of the two first linear discriminants, LD1 nearly
perfectly separates the data into two clusters: those with high crime
rate, and those with other values. rad has the clearly
highest coefficient in LD1, which can be seen both from the biplot and
the LDA summary.
LD2 seems to find another axis within the data that explains a
smaller effect. The largest coefficients in LD2 belong to
nox, medv, and zn.
# Drop the result variables
facit <- test$crime
test <- dplyr::select(test, -crime)
Let’s predict the crime rate quartiles in the test set and cross tabulate:
# Predict classes in the test data
lda.pred <- predict(lda.fit, newdata = test)
# Do a confusion matrix
tab <- table(correct = facit, predicted = lda.pred$class)
tab
## predicted
## correct low med_low med_high high
## low 15 9 0 0
## med_low 5 13 8 0
## med_high 1 10 19 1
## high 0 0 0 21
nrow(test)
## [1] 102
and here’s the same table as a confusion matrix:
image(tab)
fig. 6.1 Confusion matrix of the LDA fit
The confusion matrix shows that the model has found an axis that aligns very well with the crime rate quartile category. Most predictions are correct (68/102), the second most common case is being off by one class (33/102) and then off by two classes (1/102).
68/102 ~= 67% is not perfect but it is a lot better than choosing by random which should give us a correct prediction 25% of the time. It looks like the model can be used to make a decent predictor for whether an area has a high or non-high crime rate (the lower left of the matrix vs. the top right), for that predictor, we would have a correct classification rate of 101/102, nearly 100%!
radlda.radfit <- lda(crime ~ rad, data = train)
lda.radpred <- predict(lda.radfit, newdata = test)
radtab <- table(correct = facit, predicted = lda.radpred$class)
image(radtab)
fig. 6.2 Confusion matrix with a single variable LDA fit
Using only rad gives us a model that seems to have
exactly the same predictive power in the high vs not
high case, but loses information in the lower quartiles.
This fits with our earlier analysis of how LD1 was mostly
rad and was able to carve out most of the high
crime rate areas from the rest.
The analysis so far suggests there are at least two clear clusters in the data, so we could just choose k = 2, but let’s check with the total within cluster sum of squares what a good choice for k would be.
I will take five samples and average them, and plot the standard deviation of the twcss for each k as error bars, this should give us a more reliable plot than just taking one sample.
# determine the number of clusters
#k_max <- 10
# calculate the total within sum of squares, take 5 samples to stabilize the variance
twcss1 <- sapply(1:10, function(k){set.seed(100); kmeans(Boston, k)$tot.withinss})
twcss2 <- sapply(1:10, function(k){set.seed(123); kmeans(Boston, k)$tot.withinss})
twcss3 <- sapply(1:10, function(k){set.seed(321); kmeans(Boston, k)$tot.withinss})
twcss4 <- sapply(1:10, function(k){set.seed(130); kmeans(Boston, k)$tot.withinss})
twcss5 <- sapply(1:10, function(k){set.seed(949); kmeans(Boston, k)$tot.withinss})
df <- as.data.frame(tibble(twcss1,twcss2,twcss3,twcss4,twcss5, k= seq(1,10)))
df <- df %>% rowwise() %>% mutate(twcss = mean(c(twcss1,twcss2,twcss3,twcss4,twcss5)))
df <- df %>% rowwise() %>% mutate(twcss_var = var(c(twcss1,twcss2,twcss3,twcss4,twcss5)))
df %>% ggplot(aes(x=k, y=twcss)) +
geom_line() +
geom_errorbar(aes(ymin=twcss-sqrt(twcss_var),ymax=twcss+sqrt(twcss_var),color="red"))+
theme(legend.position="none") +
scale_x_continuous(breaks=df$k) +
scale_y_continuous(breaks=seq(1000000,10000000,2000000))
fig. 7.1 k-means twcss plot
It does look like the plot agrees that k = 2 gives a very good fit. The k-means algorithm seems to always find the same clusters here, because the error bars are attached to each other, indicating that the twcss measure is constant here.
k=3 is also a potential choice, although less clear to me. After that, however, the variance increases greatly and the twcss delta starts giving minimal returns, which indicates that there isn’t a clear structure to the data which would guide the clustering.
I will go with k=2, as that seems to match what we saw in the data earlier, and the clusters are very stable.
Boston_kmeans <- Boston %>% scale %>% as.data.frame
set.seed(101)
# k-means clustering
k2m <- kmeans(Boston_kmeans, centers = 2)
summary(k2m)
## Length Class Mode
## cluster 506 -none- numeric
## centers 28 -none- numeric
## totss 1 -none- numeric
## withinss 2 -none- numeric
## tot.withinss 1 -none- numeric
## betweenss 1 -none- numeric
## size 2 -none- numeric
## iter 1 -none- numeric
## ifault 1 -none- numeric
k2m$centers
## crim zn indus chas nox rm
## 1 -0.3894158 0.2621323 -0.6146857 0.002908943 -0.5823397 0.2446705
## 2 0.7238295 -0.4872402 1.1425514 -0.005407018 1.0824279 -0.4547830
## age dis rad tax ptratio black lstat
## 1 -0.4331555 0.4540421 -0.5828749 -0.6291043 -0.2943707 0.3282754 -0.4530491
## 2 0.8051309 -0.8439539 1.0834228 1.1693521 0.5471636 -0.6101842 0.8421083
## medv
## 1 0.3532917
## 2 -0.6566834
It seems that even this k=2 means clustering has found the high-crime rate cluster. Let’s confirm that with a visualization.
ggpairs(Boston_kmeans[c("crim", "rad", "tax", "black", "age", "medv", "dis", "zn", "rm", "lstat")], aes(color=factor(k2m$cluster), fill=factor(k2m$cluster)), lower=list(combo=wrap("facethist",binwidth=0.5)))
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
fig. 7.1 k-means pairs
It seems that the model has picked two clusters that have the following relative position to each other:
So it seems we have found a blue cluster with a lot of business activity (high tax rate, close to employment centers), with access to arterial highways, and high density building (lower number of rooms, zoned for smaller plots) and a red cluster of areas with less business activity, in relatively quiet regions with longer commutes and more working class or non-educated people, and a much higher proportion of black residents.
So it seems like the red regions are new developments, new suburbs
farther away from the city, and there may be some price discrimination
in the house prices (medv) connected with the high
proportion of black residents living there. You can see effects of
segregationist US housing policy in the data. E.g., the 1949 housing act
set up a framework to subsidize public housing for whites with clauses
forbidding resale to black people, which means that black people paid
more for housing (see e.g. “Abramovitz & Smith, The Persistence
of Residential Segregation by Race, 1940 to 2010: The Role of Federal
Housing Policy,Families in Society, Vol. 102, Issue 1” for
more).
Why not try with k=3 for good measure? Maybe we can find additional structure in the data.
set.seed(124293)
k3m <- kmeans(Boston_kmeans, centers = 3)
summary(k3m)
## Length Class Mode
## cluster 506 -none- numeric
## centers 42 -none- numeric
## totss 1 -none- numeric
## withinss 3 -none- numeric
## tot.withinss 1 -none- numeric
## betweenss 1 -none- numeric
## size 3 -none- numeric
## iter 1 -none- numeric
## ifault 1 -none- numeric
k3m$centers
## crim zn indus chas nox rm age
## 1 -0.4076669 1.1526549 -0.9877755 -0.10115080 -0.9634859 0.7739125 -1.1241828
## 2 -0.3688324 -0.3935457 -0.1369208 0.07398993 -0.1662087 -0.1700456 0.1673019
## 3 0.8942488 -0.4872402 1.0913679 -0.01330932 1.1109351 -0.4609873 0.7828949
## dis rad tax ptratio black lstat
## 1 1.05650031 -0.5965522 -0.6837494 -0.62478941 0.3607235 -0.90904433
## 2 -0.07766431 -0.5799077 -0.5409630 -0.04596655 0.2680397 -0.05818052
## 3 -0.84882600 1.3656860 1.3895093 0.63256391 -0.7083974 0.90799414
## medv
## 1 0.84137443
## 2 -0.04811607
## 3 -0.69550394
ggpairs(Boston_kmeans[c("crim", "rad", "tax", "black", "age", "medv", "dis", "zn", "rm", "lstat")], aes(color=factor(k3m$cluster), fill=factor(k3m$cluster)), lower=list(combo=wrap("facethist",binwidth=0.5)))
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
fig. 7.1 k-means pairs
Here we see that the blue cluster in this plot is roughly the same as the blue cluster from the k=2 clustering. The k=2 red cluster has here been split into red and green.
The differences in the red and green clusters seem to be: - the red
cluster has higher values of zn more big plots - the red
cluster has a smaller proportion of older buildings - the red cluster
consists of exclusively black neighbourhoods, while the green cluster
has some spread - the green cluster is between the red and blue clusters
when it comes to proportion of laborers and uneducated adults
Maybe the red cluster matches better with those black neighbourhoods built more recently, which the 1949 Housing Act and the Federal Housing Authority regulations apply to? I don’t know for sure, more analysis would be required.
Actually, a fairly large selection of statistical methods can be listed under the title “dimensionality reduction techniques”. Most often (nearly always, that is!) the real-world phenomena are multidimensional: they may consist of not just two or three but 5 or 10 or 20 or 50 (or more) dimensions. Of course, we are living only in a three-dimensional (3D) world, so those multiple dimensions may really challenge our imagination. It would be easier to reduce the number of dimensions in one way or another.
We shall now learn the basics of two data science based ways of reducing the dimensions. The principal method here is principal component analysis (PCA), which reduces any number of measured (continuous) and correlated variables into a few uncorrelated components that collect together as much variance as possible from the original variables. The most important components can be then used for various purposes, e.g., drawing scatterplots and other fancy graphs that would be quite impossible to achieve with the original variables and too many dimensions.
Multiple correspondence analysis (MCA) and other variations of CA bring us similar possibilities in the world of discrete variables, even nominal scale (classified) variables, by finding a suitable transformation into continuous scales and then reducing the dimensions quite analogously with the PCA. The typical graphs show the original classes of the discrete variables on the same “map”, making it possible to reveal connections (correspondences) between different things that would be quite impossible to see from the corresponding cross tables (too many numbers!).
Briefly stated, these methods help to visualize and understand multidimensional phenomena by reducing their dimensionality that may first feel impossible to handle at all.
library(readr)
library(tibble)
library(GGally)
library(corrplot)
library(dplyr)
library(FactoMineR)
human0 <- read_csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/human2.csv", show_col_types = FALSE)
Move the country names to rownames (see Exercise 5.5).Show a graphical overview of the data and show summaries of the variables in the data.Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them.
(0-3 points)human <- column_to_rownames(human0, "Country")
Let’s look at the data, starting with some descriptive statistics:
summary(human)
## Edu2.FM Labo.FM Life.Exp Edu.Exp
## Min. :0.1717 Min. :0.1857 Min. :49.00 Min. : 5.40
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:66.30 1st Qu.:11.25
## Median :0.9375 Median :0.7535 Median :74.20 Median :13.50
## Mean :0.8529 Mean :0.7074 Mean :71.65 Mean :13.18
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:77.25 3rd Qu.:15.20
## Max. :1.4967 Max. :1.0380 Max. :83.50 Max. :20.20
## GNI Mat.Mor Ado.Birth Parli.F
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
Based on the difference between mean and median, and the ranges of
the variables, Edu.Exp and Parli.FM seem the
most normally distributed. Edu2.FM, Labo.FM
and Life.Exp are more skewed or have long tails
(bdifference between mean and median, and the ranges of the variables).
GNI and Mat.Mor are closer to log-normal (or a
related distribution), and maybe Ado.Birth as well.
corrplot(cor(human), method="circle",type = "upper", tl.pos = "d")
fig. 1.1, correlation matrix of variables
There are definitely strong correlations in the data. We can identify four different sets of variables from this matrix.
First, there is one set of very strongly correlated
variables: Mat.Mor, Ado.Birth,
Life.Exp, and Edu.Exp. These will likely be
very strongly represented in the first principal component of a PCA.
Second, these four variables also correlate very strongly with
GNI and Edu2.FM, more strongly than that pair
of variables correlate between themselves. Since they correlate less
amongst themselves, there seems to be more degrees of freedom here.
Then there is a third group, Labo.FM and
Parli.F. They seem much less correlated compared to the
other set of variables, however, they are most strongly correlated with
each other. Again, seemingly more degrees of freedom in this group.
ggpairs(human, progress = FALSE)
fig. 1.2, distributions and scatterplots of variables
The correlations become much more clear when looking at the
scatterplots. Some are clearly linear, like Edu.Exp and
Life.Exp. It’s even clearer that GNI might
need to be log-transformed, based on the distribution and the scatter
plot shapes. Possibly Mat.Mor and Ado.Birth
too.
The other distributions look like skewed normal-like distributions, all of them have one big main mode, although there are some interesting concentrations in certain parts of the distributions.
E.g., in Life.Exp, there is a clear plateau from 60 to
roughly 65 or so. This may be related to quality of life before and
after retirement.
Another example is this curious smaller mode in the
Edu2.FM around 0.5, and a subsequent drop after that. This
suggests that there are two overlaid populations, one where the mode is
a little under 1, another where the mode is around 0.5. I don’t have a
hypothesis for what could be causing this difference in behaviors in the
two populations of countries, but it’s a very interesting feature of the
distribution. Unfortunately we’ve already thrown away the two variables
that this variable is based on, which may have given us some clues to
what this effect is.
human.log <- human %>% mutate(GNI = log(GNI)) %>% mutate(Mat.Mor = log(Mat.Mor)) %>% mutate(Ado.Birth = log(Ado.Birth))
ggpairs(human.log, progress = FALSE)
fig. 1.3, distributions and scatterplots of variables after log transform
The scatter plots now seem to form much clearer trend lines, which seems to indicate this was a good idea. Let’s recheck the correlation matrix, just to check that the scale of the data didn’t mix things up too much.
corrplot(cor(human.log), method="circle",type = "upper", tl.pos = "d")
fig. 1.4, correlation matrix using log transformed dataset
Well, this does seem to have changed a lot of things. The
correlations are now much more even among five variables:
Life.Exp, Edu.Exp, GNI,
Mat.Mor, and Ado.Birth. The correlation with
Edu2.FM is a little weaker, and the Parli.F
and Labo.FM variables are again least correlated with the
others.
Perform principal component analysis (PCA) on the raw (non-standardized) human data.Show the variability captured by the principal components.Draw a biplot displaying the observations by the first two principal components (PC1 coordinate in x-axis, PC2 coordinate in y-axis), along with arrows representing the original variables.
(0-2 points)Since the log transform was not part of the assignment, let’s forget it for now, and do PCA on the raw dataset and look at the coefficients.
pca_raw <- prcomp(human)
pca_raw
## Standard deviations (1, .., p=8):
## [1] 1.854416e+04 1.855219e+02 2.518701e+01 1.145441e+01 3.766241e+00
## [6] 1.565912e+00 1.912052e-01 1.591112e-01
##
## Rotation (n x k) = (8 x 8):
## PC1 PC2 PC3 PC4 PC5
## Edu2.FM -5.607472e-06 0.0006713951 -3.412027e-05 -2.736326e-04 -0.0022935252
## Labo.FM 2.331945e-07 -0.0002819357 5.302884e-04 -4.692578e-03 0.0022190154
## Life.Exp -2.815823e-04 0.0283150248 1.294971e-02 -6.752684e-02 0.9865644425
## Edu.Exp -9.562910e-05 0.0075529759 1.427664e-02 -3.313505e-02 0.1431180282
## GNI -9.999832e-01 -0.0057723054 -5.156742e-04 4.932889e-05 -0.0001135863
## Mat.Mor 5.655734e-03 -0.9916320120 1.260302e-01 -6.100534e-03 0.0266373214
## Ado.Birth 1.233961e-03 -0.1255502723 -9.918113e-01 5.301595e-03 0.0188618600
## Parli.F -5.526460e-05 0.0032317269 -7.398331e-03 -9.971232e-01 -0.0716401914
## PC6 PC7 PC8
## Edu2.FM 2.180183e-02 6.998623e-01 7.139410e-01
## Labo.FM 3.264423e-02 7.132267e-01 -7.001533e-01
## Life.Exp -1.453515e-01 5.380452e-03 2.281723e-03
## Edu.Exp 9.882477e-01 -3.826887e-02 7.776451e-03
## GNI -2.711698e-05 -8.075191e-07 -1.176762e-06
## Mat.Mor 1.695203e-03 1.355518e-04 8.371934e-04
## Ado.Birth 1.273198e-02 -8.641234e-05 -1.707885e-04
## Parli.F -2.309896e-02 -2.642548e-03 2.680113e-03
PC1 is nearly all GNI, PC2 is nearly all Mat.Mor, and PC3 is nearly
all Ado.Birth, etc. This is a problem. Let’s see why by looking at the
summary of the transformed rows, which are in
pca_raw$x.
summary(pca_raw$x)
## PC1 PC2 PC3 PC4
## Min. :-105495 Min. :-858.85 Min. :-87.306 Min. :-37.798
## 1st Qu.: -6885 1st Qu.: -75.30 1st Qu.:-11.498 1st Qu.: -6.643
## Median : 5587 Median : 60.76 Median : 2.681 Median : 1.787
## Mean : 0 Mean : 0.00 Mean : 0.000 Mean : 0.000
## 3rd Qu.: 13430 3rd Qu.: 117.75 3rd Qu.: 13.592 3rd Qu.: 8.200
## Max. : 17051 Max. : 200.85 Max. : 99.552 Max. : 26.267
## PC5 PC6 PC7 PC8
## Min. :-16.0452 Min. :-4.38647 Min. :-0.70772 Min. :-0.43774
## 1st Qu.: -2.0796 1st Qu.:-1.10824 1st Qu.:-0.08793 1st Qu.:-0.08919
## Median : 0.5794 Median : 0.07198 Median : 0.02775 Median :-0.01657
## Mean : 0.0000 Mean : 0.00000 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.: 2.3562 3rd Qu.: 1.09716 3rd Qu.: 0.10801 3rd Qu.: 0.08364
## Max. : 7.6214 Max. : 3.80771 Max. : 0.80521 Max. : 0.50052
As we saw earlier, GNI per capita goes from roughly 500 to 100000, while the other variables were all max in the hundreds. Since PCA will maximize the spread along each principal component, we will get some bad decisions, because the distance scales in these variables are not comparable (the dataset needs to be standardized).
If we look at a summary of the pca, this is even more clear:
summary(pca_raw)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## Standard deviation 1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912 0.1591
## Proportion of Variance 9.999e-01 0.0001 0.00 0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion 9.999e-01 1.0000 1.00 1.00 1.000 1.000 1.0000 1.0000
This tells us that PC1 captures nearly all of the variance (>0.9999) in the dataset, which again is due to the dataset not being standardized.
Knowing this, we can’t expect a very good biplot, but let’s plot one anyway.
pca_pr <- round(1*summary(pca_raw)$importance[2, ], digits = 4)
labels <- paste0(names(pca_pr), " (", pca_pr, "%)")
biplot(pca_raw, cex = c(0.8, 1), col = c("grey40", "#dd4028"), xlab = paste("Negative GNI,", labels[1]), ylab = paste("Maternal mortality,",labels[2]))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
fig. 2.1, biplot for PCA of the raw data
Again, we see that PC1 is pretty much only GNI. This is evident from the fact that the GNI arrow is the only visible one. This can also be inferred from looking at the order of the countries, high GNI countries are to the left, low GNI countries to the right.
I’ve named the principal components according to the variable that they’ve picked up from the dataset (negative GNI and Maternal Mortality), since each of the first few components are almost aligned with one dimension.
Standardize the variables in the human data and repeat the above analysis.Interpret the results of both analysis (with and without standardizing).Are the results different? Why or why not? Include captions (brief descriptions) in your plots where you describe the results by using not just your variable names, but the actual phenomena they relate to.
(0-4 points)Let’s standardize, do a PCA, and look at the principal components.
pca_scaled <- human %>% scale %>% as.data.frame %>% prcomp
pca_scaled
## Standard deviations (1, .., p=8):
## [1] 2.0708380 1.1397204 0.8750485 0.7788630 0.6619563 0.5363061 0.4589994
## [8] 0.3222406
##
## Rotation (n x k) = (8 x 8):
## PC1 PC2 PC3 PC4 PC5
## Edu2.FM -0.35664370 0.03796058 -0.24223089 0.62678110 -0.5983585
## Labo.FM 0.05457785 0.72432726 -0.58428770 0.06199424 0.2625067
## Life.Exp -0.44372240 -0.02530473 0.10991305 -0.05834819 0.1628935
## Edu.Exp -0.42766720 0.13940571 -0.07340270 -0.07020294 0.1659678
## GNI -0.35048295 0.05060876 -0.20168779 -0.72727675 -0.4950306
## Mat.Mor 0.43697098 0.14508727 -0.12522539 -0.25170614 -0.1800657
## Ado.Birth 0.41126010 0.07708468 0.01968243 0.04986763 -0.4672068
## Parli.F -0.08438558 0.65136866 0.72506309 0.01396293 -0.1523699
## PC6 PC7 PC8
## Edu2.FM 0.17713316 0.05773644 0.16459453
## Labo.FM -0.03500707 -0.22729927 -0.07304568
## Life.Exp -0.42242796 -0.43406432 0.62737008
## Edu.Exp -0.38606919 0.77962966 -0.05415984
## GNI 0.11120305 -0.13711838 -0.16961173
## Mat.Mor 0.17370039 0.35380306 0.72193946
## Ado.Birth -0.76056557 -0.06897064 -0.14335186
## Parli.F 0.13749772 0.00568387 -0.02306476
This already looks much better. Let’s see the spreads.
summary(pca_scaled$x)
## PC1 PC2 PC3 PC4
## Min. :-3.4128 Min. :-2.79379 Min. :-2.13797 Min. :-3.43435
## 1st Qu.:-1.4651 1st Qu.:-0.61041 1st Qu.:-0.61625 1st Qu.:-0.51644
## Median :-0.4934 Median : 0.04166 Median :-0.07731 Median : 0.05376
## Mean : 0.0000 Mean : 0.00000 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.: 1.0237 3rd Qu.: 0.68895 3rd Qu.: 0.47541 3rd Qu.: 0.54936
## Max. : 6.0717 Max. : 3.12552 Max. : 2.56743 Max. : 2.27574
## PC5 PC6 PC7 PC8
## Min. :-2.74409 Min. :-1.49002 Min. :-0.98931 Min. :-1.0576
## 1st Qu.:-0.33133 1st Qu.:-0.34286 1st Qu.:-0.29823 1st Qu.:-0.1581
## Median : 0.06288 Median : 0.02236 Median :-0.02144 Median : 0.0261
## Mean : 0.00000 Mean : 0.00000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.41848 3rd Qu.: 0.26156 3rd Qu.: 0.29026 3rd Qu.: 0.1657
## Max. : 1.54550 Max. : 1.82764 Max. : 1.13872 Max. : 1.3844
Ok, the range of the dimensions seem much more sensible now, they are all in the same order of magnitude. Let’s see how much variance each principal component explains.
summary(pca_scaled)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.0708 1.1397 0.87505 0.77886 0.66196 0.53631 0.45900
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595 0.02634
## Cumulative Proportion 0.5361 0.6984 0.79413 0.86996 0.92473 0.96069 0.98702
## PC8
## Standard deviation 0.32224
## Proportion of Variance 0.01298
## Cumulative Proportion 1.00000
PC1 explains more than half the variance, not bad! All principal components do seem to capture at least one percent of the variance however, so we weill be losing information if we decide to cut this off.
pca_pr <- round(1*summary(pca_scaled)$importance[2, ], digits = 4)
labels <- paste0(names(pca_pr), " (", pca_pr, "%)")
biplot(pca_scaled, cex = c(0.8, 1), col = c("grey40", "#dd4028"), xlab = paste("neg. Human development,", labels[1]), ylab = paste("Gender equality,",labels[2]))
fig. 3.1, biplot for PCA of the scaled data
Much better, and better yet, all original variables are roughly axis-aligned. I’ve added descriptive labels based on which variables align with which axes, more on these in section 4.
As already discussed, the raw data was a bad fit for PCA due to the different orders of magnitude in the dispersion among the variables. PCA on raw data essentially just picked out one variable at a time in decreasing order of scale.
PCA on the scaled data performs much better.
I’ve decided to name PC1 neg. Human development,
inspired by the name of the original dataset. This principal component
measures the welfare of the country in terms of health (Mat.Mor,
Ado.Birth, Life.Exp), standard of living (GNI per cap., ppp adjusted),
and education (Edu.Exp, Edu2.FM). The value of this component is smaller
with better outcomes in these domains, which is where the
neg. comes in. I would rather take the negative of this PC1
and call it Human development, but this is what the PCA
gave us.
PC2 I’ve called gender equality, because it measures female participation in political decision-making (0.65 * Parli.F) and female participation in the labour market (0.72 * Labo.FM).
This is perhaps not the whole story, because this component has a positive contribution from maternal mortality rates and adolescent birth. Maybe this component only measures whether society has moved away from traditional gender roles. In that case, this can be seen as a cultural liberal/conservative axis.
PC3 is positively correlated with female participation in political decision-making (0.73 * Parli.F), but negatively correlated with the ratio of female participation in labor markets and secondary education (-0.584 * Labo.FM, -0.24 * Edu2.FM).
Let’s flip this around and consider NPC3 = -PC3, it’s
easier to reason about that way. If a country has a high level of female
MPs relative to the female education and participation in the labor
market, then NPC3 is high.
So this principal component measures how women are viewed as in society. Are they seen as leaders (and elected into parliament)? then PC3 is low. Are they seen as useful in the workforce but not as leaders? then PC3 is high.
The principal components are getting harder and harder to reason about as we go further down the list.
Roughly,
PC4 = 0.62 Edu2.FM - 0.72 GNI - 0.25 Mat.Mor.
These variables don’t seem to make a clear story. Edu2.FM should be very close to 1 for all developed nations, as high school dropouts are rare, and with negative GNI per cap. maybe this axis is about the economic development of the country? The maternal mortality rate is difficult to square with this.
This component might measure the relative focus on wealth as compared to other societal welfare in the country. With heavy emphasis on might.
human_scaled_pca <- data.frame(pca_scaled$x)
index <- order(human_scaled_pca$PC4)
human_scaled_pca %>% arrange(desc(PC4))
## PC1 PC2 PC3
## Myanmar 0.22129643 -0.339858349 -2.137967976
## Gabon 0.38951619 0.539298671 -1.477157917
## Guyana 0.81614017 -0.005126477 0.855888106
## Libya -1.53980422 -1.493748457 0.207781921
## Tajikistan 0.21994425 -0.299649920 -0.471229436
## Lesotho 2.41352165 0.966326284 -0.469152353
## Honduras 0.37144496 -0.477332449 0.893880234
## Nicaragua 0.33842550 0.570944859 1.600126090
## Jamaica -0.06175153 0.015692724 -0.477625887
## Samoa -0.49340583 -2.067158649 -0.048787759
## Moldova (Republic of) -0.14463141 0.316328188 -0.310093407
## Kyrgyzstan -0.09841728 -0.010254730 0.233419334
## Philippines 0.25460237 0.009222189 0.528877460
## Belize 0.03821611 -0.834020935 -0.164341793
## Armenia -0.41497907 -0.613047804 -0.668870242
## Bhutan 0.21621916 -0.205695481 -1.274792504
## Rwanda 1.08474351 3.125519996 1.399402307
## Dominican Republic 0.15665142 -0.232069047 -0.005138172
## Namibia 0.58896079 1.429030129 0.561863707
## Georgia -0.38211425 -0.445864707 -0.649432206
## Tonga -0.45134514 -1.165949078 -1.371247180
## Ecuador -0.48514379 1.062937942 1.432626667
## Mongolia -0.58002389 0.031790917 -0.819700835
## Venezuela (Bolivarian Republic of) -0.35614736 -0.328185500 -0.262447337
## Fiji -0.54064402 -1.018100867 0.043783998
## Colombia -0.24456731 -0.012229219 -0.017543050
## Ukraine -0.61256414 -0.236511138 -0.828004404
## Guatemala 0.92978002 -1.021873670 0.076300346
## Barbados -0.83161287 0.528012907 -0.591747431
## Cuba -0.85690790 1.206502306 2.170553442
## Brazil -0.52713454 -0.437211056 -0.914408365
## South Africa 0.52541319 1.290788972 0.929186725
## Costa Rica -0.88088858 0.247783224 1.177525013
## Panama -0.52104983 -0.445966480 0.109079216
## Viet Nam -0.12772727 0.635409753 -0.023983110
## Albania -0.67704042 -0.334378287 0.233899467
## Uruguay -1.10762421 -0.407424570 -0.791498107
## Bangladesh 1.17163475 -0.264831864 0.284988061
## Azerbaijan -0.13345597 0.270223921 -0.917559403
## Sri Lanka -0.75648305 -1.887423148 -0.154346306
## Bulgaria -0.83448409 0.297191795 -0.354609758
## El Salvador 0.31502973 -0.082434305 0.932630500
## Bahamas -0.80193053 0.256043765 -0.862499787
## Paraguay 0.44428803 -0.500263835 0.076871476
## Swaziland 2.33315345 -0.596949105 -0.306431397
## Bolivia (Plurinational State of) 0.61863947 2.118079004 1.809884477
## Congo 2.50890521 0.529214816 -1.324187811
## Belarus -1.09207196 0.821722632 0.249889580
## Kenya 2.23833543 0.681110048 -0.446481460
## Botswana 0.51070846 -0.005038506 -1.402977382
## Thailand -0.32856107 -0.610848412 -1.075159434
## Peru -0.08991745 0.379961425 -0.057040060
## Zimbabwe 2.24835708 1.728001429 0.178907487
## Mexico -0.53292222 0.345557424 1.529477321
## Kazakhstan -0.75696807 0.549022028 -0.685887511
## Uganda 2.95628470 1.805866347 0.326356280
## Suriname -0.12432505 -0.997314985 -0.287838649
## Latvia -1.31130851 0.187341613 -0.642976519
## Montenegro -1.11686706 -0.122488486 -0.280938313
## Romania -0.77188498 -0.403249223 -0.702527677
## Jordan -0.57591739 -2.362089025 0.910026604
## Syrian Arab Republic 0.36377295 -2.559738024 1.335263924
## Russian Federation -0.80811646 -0.054778581 -0.829752646
## Lebanon -1.23268082 -2.526681094 0.043940503
## China -0.67672686 0.359953468 0.049521929
## Maldives -0.67904281 -0.989578003 -0.802207742
## Hungary -1.36809208 -0.493681621 -0.942204385
## Chile -1.22310783 -0.447088566 -0.163992357
## Serbia -0.83377077 0.705073607 0.951750046
## Iran (Islamic Republic of) -0.96436342 -2.793788341 0.315434603
## Indonesia 0.48036817 -0.567596783 0.116329285
## Croatia -1.30589585 0.406709990 0.159833574
## Estonia -1.68580633 0.371211438 -0.569235608
## Slovakia -1.48297697 -0.019596892 -0.401935798
## Argentina -1.45601757 0.838984586 1.037680344
## Trinidad and Tobago -0.43657185 0.131125955 0.074004863
## Portugal -1.99792574 1.040595945 0.180046605
## Ghana 2.01345921 0.355485039 -1.241712687
## Lithuania -1.47424792 0.607825365 -0.365058067
## Algeria -0.88489346 -1.630270510 1.842073337
## Czech Republic -1.94024133 0.041663314 -0.421174852
## Poland -1.51988207 0.193167920 -0.111253539
## Malaysia -0.92235926 -0.984616060 -0.085310083
## The former Yugoslav Republic of Macedonia -0.62281995 0.263132896 1.292063415
## Haiti 2.36064900 -0.561384801 -1.290302990
## Sudan 2.60420748 -1.045381397 1.334927961
## Mauritius -0.79068829 -0.937216376 -0.221835304
## Tanzania (United Republic of) 2.88648521 1.899305596 0.487200420
## Slovenia -2.16450489 0.836704528 -0.077310885
## Egypt 0.07863364 -2.585345419 0.217108451
## Morocco 0.36411731 -2.047307836 0.794721315
## Zambia 2.37579379 0.275956183 -0.731488256
## Papua New Guinea 2.24414173 -0.288614274 -1.445469022
## Tunisia -0.84854176 -0.830766824 1.960056511
## Gambia 3.41276063 -0.007016863 -0.901829512
## Cyprus -1.52126616 -0.301457929 -0.810134452
## Bosnia and Herzegovina -0.49234291 -0.706061064 0.646486710
## Israel -2.08527706 0.550448154 -0.402495966
## Bahrain -1.83627456 -1.277276062 -0.005796514
## Greece -1.99151909 0.052754647 -0.028068250
## New Zealand -2.49492486 1.269102839 0.014971514
## Cameroon 3.50119915 1.065163090 0.063364397
## Malawi 3.49174202 1.207150760 -0.936231767
## Nepal 1.43508146 1.194786495 0.463613461
## Burundi 2.92716442 2.104937966 -0.100589338
## Iceland -2.74910896 2.053637043 0.419653402
## Senegal 2.60628829 1.250222087 1.794207914
## Malta -1.44928760 -1.028809545 -0.081348924
## Japan -2.20169826 -0.609972749 -0.751252991
## Finland -2.48617797 1.892468888 0.594038932
## Cambodia 1.49861802 0.434332761 -0.135571575
## Mali 4.42864302 -0.725923093 -0.189211918
## Spain -2.33209334 1.348636610 0.700913469
## United Kingdom -2.05713992 0.589584266 -0.407850345
## Ireland -2.62808538 0.362020337 -0.640755797
## Benin 2.85453769 -0.106611646 -0.837692019
## Ethiopia 2.92275146 0.800047424 0.272497038
## Canada -2.26302904 1.033411190 -0.292957486
## Italy -2.18260449 0.338893545 0.646989575
## Iraq 0.96258010 -1.699941581 2.252525442
## France -2.19885491 0.696790265 -0.217303004
## Turkey -0.57530772 -1.527893514 0.722212692
## Austria -2.38403803 0.902207024 -0.003026204
## India 1.10413681 -1.992805576 1.003863214
## Pakistan 1.72263666 -1.938459968 1.749897520
## Togo 2.92936876 1.008638618 -0.699996373
## Germany -2.51325622 1.320373244 0.388898222
## Australia -2.99370191 1.188201602 -0.106765315
## Sweden -2.51286376 1.963931278 0.595735063
## Belgium -2.35585306 1.588259556 0.847919087
## Denmark -2.78120158 1.773429808 0.171302280
## Korea (Republic of) -2.10167572 -0.277374746 -0.352782824
## Mauritania 2.60820181 -1.330052714 1.623557480
## United Arab Emirates -2.14191598 -0.901306523 -0.292906252
## Oman -1.29017323 -2.054918869 0.293468342
## Netherlands -2.72716147 1.469105166 0.308130817
## Côte d'Ivoire 4.64084255 -0.598211941 -0.459138696
## Burkina Faso 3.94333744 0.053955410 -0.319270357
## Liberia 3.96747850 0.317838331 -0.863492996
## Congo (Democratic Republic of the) 4.43880394 0.539558045 -1.303548841
## Mozambique 4.24412061 2.357683071 0.761360284
## Yemen 2.36569153 -2.652557169 0.382761370
## Niger 5.39088099 -1.241241283 0.873413438
## United States -2.15666301 0.461289598 -0.876464679
## Central African Republic 5.15979806 0.299745466 -0.827624027
## Sierra Leone 5.29969299 0.894599149 -1.383380245
## Switzerland -2.69683318 0.896899989 -0.275048181
## Luxembourg -2.34430026 0.673760445 -0.178640850
## Afghanistan 3.17249375 -1.530064615 2.567425390
## Norway -3.08378347 1.890281525 0.056969733
## Saudi Arabia -2.04814861 -1.613019142 0.881908713
## Chad 6.07170782 0.419662712 -0.379782202
## Singapore -2.86994888 0.516674990 -0.429116560
## Kuwait -2.34123467 -1.651753979 -1.483382891
## Qatar -3.41275720 -1.651554785 -2.092935891
## PC4 PC5
## Myanmar 2.275737391 -0.784101836
## Gabon 1.778909476 -2.181779590
## Guyana 1.490657114 -1.914978588
## Libya 1.311190236 -0.777481767
## Tajikistan 1.280122465 0.062883028
## Lesotho 1.226549015 -1.630682971
## Honduras 1.206509214 -1.014293478
## Nicaragua 1.100369752 -1.082146297
## Jamaica 1.044404927 -0.227410096
## Samoa 1.016812799 -0.110042154
## Moldova (Republic of) 0.994627115 0.470592772
## Kyrgyzstan 0.984423355 0.253963473
## Philippines 0.943707745 -0.508529110
## Belize 0.905689903 -0.354976024
## Armenia 0.845581564 0.440952425
## Bhutan 0.843768126 0.344669908
## Rwanda 0.823466529 -0.094488185
## Dominican Republic 0.820699804 -0.898427787
## Namibia 0.814718726 -0.404258828
## Georgia 0.800279490 0.377801426
## Tonga 0.781942289 0.749978939
## Ecuador 0.768903920 -0.699510243
## Mongolia 0.758449744 0.440072188
## Venezuela (Bolivarian Republic of) 0.758276118 -0.904889663
## Fiji 0.751446198 0.004254590
## Colombia 0.739148454 -0.404110156
## Ukraine 0.735815695 0.683855615
## Guatemala 0.732023267 -0.737883028
## Barbados 0.716841534 0.216873945
## Cuba 0.652802978 -0.141187770
## Brazil 0.635546781 -0.242856529
## South Africa 0.625719759 -0.626998851
## Costa Rica 0.612051066 -0.448031363
## Panama 0.606356209 -0.882843754
## Viet Nam 0.603302121 0.872257041
## Albania 0.600494707 0.497187324
## Uruguay 0.599660130 -0.231245152
## Bangladesh 0.581364674 -0.152520365
## Azerbaijan 0.560060281 0.186188142
## Sri Lanka 0.538663748 0.378862799
## Bulgaria 0.529608478 0.276207632
## El Salvador 0.498582285 -0.209535347
## Bahamas 0.488197758 0.057513980
## Paraguay 0.471807391 0.004514533
## Swaziland 0.453557485 -0.664998668
## Bolivia (Plurinational State of) 0.399174576 -0.250247599
## Congo 0.398128908 -0.669796656
## Belarus 0.394723997 0.361435870
## Kenya 0.390264593 -0.349458967
## Botswana 0.355643457 0.015774383
## Thailand 0.347688665 0.621808637
## Peru 0.346459243 0.357781868
## Zimbabwe 0.329045089 -0.135099353
## Mexico 0.313139790 -0.529001350
## Kazakhstan 0.308198625 0.224421990
## Uganda 0.305231215 -0.547035571
## Suriname 0.298543025 -0.101737220
## Latvia 0.278702422 0.357077524
## Montenegro 0.271709792 0.787994084
## Romania 0.270762447 0.358282124
## Jordan 0.239258300 -0.030840898
## Syrian Arab Republic 0.236685678 0.186780574
## Russian Federation 0.230843800 0.198560713
## Lebanon 0.219929309 0.207450407
## China 0.207098455 0.948090559
## Maldives 0.204993662 1.087492380
## Hungary 0.190542428 0.422324617
## Chile 0.151604505 -0.029232073
## Serbia 0.150877958 0.741356169
## Iran (Islamic Republic of) 0.145504912 -0.038590210
## Indonesia 0.140336726 0.123615819
## Croatia 0.137752539 0.539931997
## Estonia 0.125625669 0.348079365
## Slovakia 0.108807736 0.184357318
## Argentina 0.100541901 -0.382426961
## Trinidad and Tobago 0.094363327 -0.478807410
## Portugal 0.070353716 0.343488087
## Ghana 0.053762696 0.570568434
## Lithuania 0.038625393 0.469887414
## Algeria 0.036027293 -0.035468371
## Czech Republic 0.034533898 0.396983608
## Poland 0.028293784 0.478839786
## Malaysia -0.007775677 0.250309695
## The former Yugoslav Republic of Macedonia -0.011789161 0.757393781
## Haiti -0.018017361 0.819230616
## Sudan -0.030656873 -0.712459363
## Mauritius -0.053895731 0.417082038
## Tanzania (United Republic of) -0.057669838 -0.228794202
## Slovenia -0.070286122 0.519533942
## Egypt -0.084109671 0.384246559
## Morocco -0.091930219 0.468826463
## Zambia -0.098419324 0.125849595
## Papua New Guinea -0.125337620 1.185766443
## Tunisia -0.167303534 0.654406033
## Gambia -0.172392139 0.044491812
## Cyprus -0.181589001 0.539000803
## Bosnia and Herzegovina -0.182925626 1.209146178
## Israel -0.191614837 0.468282001
## Bahrain -0.214807455 -0.784249226
## Greece -0.224756734 0.689741659
## New Zealand -0.241859264 0.196214374
## Cameroon -0.260759929 -0.555844104
## Malawi -0.308982421 0.054072818
## Nepal -0.316593826 1.114794397
## Burundi -0.323059706 0.419879197
## Iceland -0.327005382 0.264383077
## Senegal -0.333524119 0.042331771
## Malta -0.354324825 0.284860129
## Japan -0.370785436 0.142433142
## Finland -0.402199400 -0.035148222
## Cambodia -0.403276572 1.545495585
## Mali -0.407128283 -1.027734140
## Spain -0.408851517 0.350774791
## United Kingdom -0.430565377 -0.121151666
## Ireland -0.474385756 0.158551905
## Benin -0.510128117 0.862388776
## Ethiopia -0.522758647 0.638178331
## Canada -0.536756869 -0.054260003
## Italy -0.553844809 0.335992609
## Iraq -0.565370695 -0.287911201
## France -0.570782466 0.293158744
## Turkey -0.590951238 0.618018515
## Austria -0.615634290 -0.107843670
## India -0.628896340 0.852767619
## Pakistan -0.633197073 0.676731900
## Togo -0.644239048 0.928867228
## Germany -0.648265885 -0.140054500
## Australia -0.667575221 0.146603167
## Sweden -0.672261525 -0.206796715
## Belgium -0.676265477 -0.050602376
## Denmark -0.683471922 0.056081866
## Korea (Republic of) -0.691151610 0.615032289
## Mauritania -0.732229396 0.149028861
## United Arab Emirates -0.773527981 -1.767659249
## Oman -0.795461293 -0.057279210
## Netherlands -0.797993542 -0.024473662
## Côte d'Ivoire -0.803721453 -0.656805429
## Burkina Faso -0.808281584 0.585365213
## Liberia -0.821010099 0.337292077
## Congo (Democratic Republic of the) -0.866980658 0.149837925
## Mozambique -0.926674225 0.313358992
## Yemen -0.949498266 1.002500642
## Niger -0.965501357 -1.303263726
## United States -0.972954474 -0.523731016
## Central African Republic -1.039430441 -0.027250931
## Sierra Leone -1.136109156 -0.260135837
## Switzerland -1.166378061 -0.293310010
## Luxembourg -1.167750947 -0.659139954
## Afghanistan -1.307267500 0.171076164
## Norway -1.429085734 -0.634591949
## Saudi Arabia -1.483999840 -0.771322474
## Chad -1.710289949 -0.313209384
## Singapore -2.145302064 -0.773851494
## Kuwait -2.268698795 -1.459482728
## Qatar -3.434350558 -2.744086530
## PC6 PC7
## Myanmar 1.7676007033 -0.806312423
## Gabon -0.1202364303 0.236755731
## Guyana 0.3639403210 -0.027849786
## Libya 0.9330807855 0.570157750
## Tajikistan 0.3475875471 -0.503606390
## Lesotho 1.0758622144 1.138717611
## Honduras -0.3160976543 -0.377880488
## Nicaragua -0.6811130040 -0.518838884
## Jamaica -0.5596006430 -0.552066779
## Samoa 0.1862354772 0.197408492
## Moldova (Republic of) 0.3856501921 -0.575902862
## Kyrgyzstan 0.4470455299 -0.082545945
## Philippines 0.5735628979 -0.188977457
## Belize -0.5257968357 0.220621752
## Armenia 0.1512678759 -0.513028090
## Bhutan 0.1354618087 -0.160322615
## Rwanda 1.4882291993 -0.291101527
## Dominican Republic -0.9977964148 -0.136457443
## Namibia 0.6531061286 -0.279270660
## Georgia -0.4329675048 -0.129134324
## Tonga 0.0223647898 0.463117389
## Ecuador -0.6229513015 0.060219694
## Mongolia 0.3664312347 0.382144832
## Venezuela (Bolivarian Republic of) -0.8202034504 0.165151944
## Fiji -0.2586706159 0.954688210
## Colombia -0.5193595878 -0.089128360
## Ukraine -0.0402324586 0.377186088
## Guatemala -0.6663882061 -0.519546181
## Barbados -0.5554976971 0.143179329
## Cuba -0.1046529231 -0.129211242
## Brazil -0.9387997247 0.259841102
## South Africa 0.8641957710 0.880893184
## Costa Rica -0.5801338772 -0.205490683
## Panama -0.7794109136 -0.261632639
## Viet Nam 0.1370254776 -0.819669561
## Albania 0.3719985522 -0.758087155
## Uruguay -0.8520051813 0.123425781
## Bangladesh -0.2801004193 -0.761911579
## Azerbaijan 0.2228914450 -0.691575682
## Sri Lanka 0.1134797992 0.181878318
## Bulgaria -0.1548721090 -0.094755523
## El Salvador -0.5208362283 -0.298642007
## Bahamas 0.2221208500 -0.684132379
## Paraguay -0.3868646873 -0.383448677
## Swaziland 0.9379950756 1.083913498
## Bolivia (Plurinational State of) -0.0005896796 0.218008164
## Congo -0.7300608217 0.035047723
## Belarus 0.2002805861 0.442680455
## Kenya 0.0003908410 0.192825515
## Botswana 0.4208028626 0.051944196
## Thailand -0.3728835623 -0.321836338
## Peru -0.2954092927 -0.346454974
## Zimbabwe 1.0326915615 0.484252449
## Mexico -0.3750194353 -0.284486795
## Kazakhstan 0.1478071951 0.258894510
## Uganda -0.2663511002 -0.222926610
## Suriname 0.2671143004 0.054969268
## Latvia 0.1885997063 0.131359913
## Montenegro -0.0735326488 0.115347703
## Romania -0.1404008807 -0.081718774
## Jordan 0.0960728327 0.426350376
## Syrian Arab Republic 0.0533687969 0.381438615
## Russian Federation 0.1868210793 0.214125170
## Lebanon -0.0082757639 0.066977961
## China 0.3732800063 -0.461876446
## Maldives 0.2331259089 -0.434462712
## Hungary 0.0530148587 0.202738900
## Chile -0.9916455370 -0.141135886
## Serbia 0.1960956378 0.001140764
## Iran (Islamic Republic of) -0.3591669622 0.722852945
## Indonesia 0.0697247267 0.318509670
## Croatia 0.1183552755 -0.081785909
## Estonia -0.1433648078 0.319266420
## Slovakia 0.0857885836 0.030612783
## Argentina -0.7562700661 0.998335318
## Trinidad and Tobago 0.5483599732 -0.287201887
## Portugal -0.1179107107 0.036420328
## Ghana 0.3721749488 0.232143202
## Lithuania 0.1583897022 0.463060174
## Algeria 0.4902133796 0.630600521
## Czech Republic 0.0054545873 0.274517961
## Poland 0.0146512625 0.078155353
## Malaysia 0.5943478473 -0.310882005
## The former Yugoslav Republic of Macedonia 0.2271429608 -0.225518600
## Haiti 0.8528400232 -0.487249059
## Sudan 0.6102373647 -0.583319974
## Mauritius -0.3224620116 0.555744456
## Tanzania (United Republic of) -0.4568728938 -0.686533066
## Slovenia 0.0215875237 0.200046218
## Egypt -0.3163885313 0.410135541
## Morocco 0.0385729540 -0.140453926
## Zambia -1.2011155278 0.645660843
## Papua New Guinea 0.0929917875 -0.590106926
## Tunisia 0.3893479627 0.555015084
## Gambia -0.3474528468 -0.395655997
## Cyprus 0.1192318589 -0.540291494
## Bosnia and Herzegovina -0.0270242943 -0.181398112
## Israel -0.1633957692 -0.182073253
## Bahrain 0.3793530928 0.116840333
## Greece -0.4677522487 0.508801969
## New Zealand -0.7465178815 0.702424299
## Cameroon 0.0737457183 0.614653903
## Malawi -1.1968378168 -0.088176314
## Nepal -0.5593956686 -0.297819695
## Burundi 1.8276366746 0.677851002
## Iceland -0.3893837420 0.527074951
## Senegal 0.1229101109 -0.989313955
## Malta -0.1902293297 -0.234558008
## Japan -0.1092196560 -0.295650382
## Finland 0.0500578223 0.125708116
## Cambodia 0.0893450889 -0.641196445
## Mali -1.1754044432 -0.019086504
## Spain -0.2384440630 0.193672897
## United Kingdom -0.3405838816 -0.083078182
## Ireland -0.3678977300 0.633980162
## Benin -0.3370880163 0.133803308
## Ethiopia 0.2799208080 -0.661627350
## Canada -0.0899079143 -0.292960474
## Italy -0.0518444082 -0.045035198
## Iraq -0.0205434516 -0.377293351
## France -0.0424944013 -0.194411707
## Turkey -0.3388083756 0.259713548
## Austria 0.2005291571 -0.252609069
## India 0.2976255773 0.295702180
## Pakistan 1.0547918604 -0.663717157
## Togo -0.3551166163 0.469693499
## Germany 0.1990240456 -0.001114540
## Australia -0.6199683954 0.917438777
## Sweden 0.2506794343 -0.372812563
## Belgium 0.1848918584 -0.041147418
## Denmark -0.0930002006 0.544652022
## Korea (Republic of) -0.2412194847 0.254405229
## Mauritania 0.3835135567 -0.206485725
## United Arab Emirates 0.4626591783 -0.456221585
## Oman 0.2559958563 -0.052902782
## Netherlands -0.0840485953 0.299683270
## Côte d'Ivoire 0.0332611178 0.771694885
## Burkina Faso -0.3017379532 -0.688616909
## Liberia -0.4465735263 0.044506003
## Congo (Democratic Republic of the) -0.6733926462 0.284824245
## Mozambique -0.4315443547 -0.191228703
## Yemen 0.3877720692 -0.105861134
## Niger -1.4900241152 -0.772778138
## United States -0.3451427925 -0.002143577
## Central African Republic 0.9515520853 0.428317282
## Sierra Leone 0.9362103161 1.063217796
## Switzerland 0.1865273520 -0.420485693
## Luxembourg 0.4269134315 -0.843634606
## Afghanistan 0.1650292048 0.419870531
## Norway 0.1057068050 -0.021435945
## Saudi Arabia 0.2995446875 0.814349653
## Chad -0.1384298783 0.505501713
## Singapore 0.2087344376 -0.631938288
## Kuwait 0.4433858104 -0.154579254
## Qatar 0.7794028704 -0.869044046
## PC8
## Myanmar 0.464249830
## Gabon -0.016809690
## Guyana 0.283691259
## Libya 0.150608995
## Tajikistan -0.216743506
## Lesotho -0.302736425
## Honduras 0.264847993
## Nicaragua 0.168834839
## Jamaica 0.212157687
## Samoa 0.295420249
## Moldova (Republic of) -0.213918903
## Kyrgyzstan -0.042950292
## Philippines -0.097634576
## Belize -0.319139671
## Armenia 0.092802384
## Bhutan -0.074744150
## Rwanda 0.123998389
## Dominican Republic -0.001090328
## Namibia -0.508079106
## Georgia 0.042053001
## Tonga 0.306781672
## Ecuador 0.136462760
## Mongolia -0.229052691
## Venezuela (Bolivarian Republic of) 0.135464359
## Fiji -0.192307926
## Colombia 0.040791381
## Ukraine -0.298001037
## Guatemala 0.082547026
## Barbados 0.027150536
## Cuba 0.481755252
## Brazil -0.015904503
## South Africa -1.053520373
## Costa Rica 0.303047765
## Panama 0.311755507
## Viet Nam 0.087508486
## Albania 0.295251360
## Uruguay 0.027424769
## Bangladesh 0.135731141
## Azerbaijan -0.410686138
## Sri Lanka 0.190906050
## Bulgaria -0.221248901
## El Salvador -0.142869401
## Bahamas 0.017655734
## Paraguay 0.035958347
## Swaziland -1.057558817
## Bolivia (Plurinational State of) -0.183113873
## Congo -0.017934893
## Belarus -0.466662216
## Kenya 0.026234807
## Botswana -0.413437307
## Thailand -0.146109071
## Peru 0.026102641
## Zimbabwe 0.016998870
## Mexico 0.070725796
## Kazakhstan -0.575145605
## Uganda -0.572916878
## Suriname 0.088444159
## Latvia -0.169022123
## Montenegro -0.024931977
## Romania -0.075240568
## Jordan 0.176778023
## Syrian Arab Republic -0.199383676
## Russian Federation -0.481743543
## Lebanon 0.488700890
## China 0.025315956
## Maldives 0.197975099
## Hungary -0.059973220
## Chile 0.323884955
## Serbia -0.152580467
## Iran (Islamic Republic of) 0.148533006
## Indonesia 0.017986999
## Croatia 0.040691396
## Estonia -0.047421539
## Slovakia -0.049162831
## Argentina 0.001276173
## Trinidad and Tobago -0.252645792
## Portugal 0.229393237
## Ghana -0.037526847
## Lithuania -0.330632189
## Algeria 0.366520434
## Czech Republic 0.125812767
## Poland -0.006129686
## Malaysia 0.025054670
## The former Yugoslav Republic of Macedonia -0.140384418
## Haiti 0.202062572
## Sudan 0.195483201
## Mauritius 0.021314559
## Tanzania (United Republic of) 0.031354925
## Slovenia 0.201136871
## Egypt -0.229456272
## Morocco 0.283122879
## Zambia -0.794475565
## Papua New Guinea -0.568359399
## Tunisia 0.085563570
## Gambia -0.154919425
## Cyprus 0.238419359
## Bosnia and Herzegovina -0.037480926
## Israel 0.298293132
## Bahrain 0.116941358
## Greece 0.204463481
## New Zealand 0.134924597
## Cameroon 0.012924475
## Malawi 0.078962173
## Nepal -0.312497632
## Burundi 0.872691209
## Iceland 0.162465364
## Senegal -0.050182834
## Malta 0.261913162
## Japan 0.466519718
## Finland 0.056001135
## Cambodia -0.343470935
## Mali -0.051504428
## Spain 0.220613119
## United Kingdom 0.075500247
## Ireland 0.142599173
## Benin -0.550244820
## Ethiopia 0.123086514
## Canada 0.172678430
## Italy 0.341553582
## Iraq -0.466207100
## France 0.236234797
## Turkey -0.161342462
## Austria 0.146278110
## India -0.049179210
## Pakistan -0.193259634
## Togo -0.266806482
## Germany 0.085755248
## Australia 0.122261331
## Sweden 0.117156987
## Belgium 0.045294793
## Denmark -0.054496911
## Korea (Republic of) 0.313970540
## Mauritania -0.121129103
## United Arab Emirates -0.094267148
## Oman 0.014389692
## Netherlands 0.062106304
## Côte d'Ivoire 0.135308345
## Burkina Faso -0.537702988
## Liberia 0.479166548
## Congo (Democratic Republic of the) 0.535636325
## Mozambique -0.793719582
## Yemen -0.164016287
## Niger 0.358842606
## United States -0.117040628
## Central African Republic 0.645440709
## Sierra Leone 1.384425435
## Switzerland 0.149931235
## Luxembourg 0.088632850
## Afghanistan -0.184925327
## Norway -0.123820241
## Saudi Arabia -0.334517472
## Chad 0.720298478
## Singapore -0.058473974
## Kuwait -0.577682871
## Qatar -0.544730977
human_scaled_pca %>% arrange(PC4)
## PC1 PC2 PC3
## Qatar -3.41275720 -1.651554785 -2.092935891
## Kuwait -2.34123467 -1.651753979 -1.483382891
## Singapore -2.86994888 0.516674990 -0.429116560
## Chad 6.07170782 0.419662712 -0.379782202
## Saudi Arabia -2.04814861 -1.613019142 0.881908713
## Norway -3.08378347 1.890281525 0.056969733
## Afghanistan 3.17249375 -1.530064615 2.567425390
## Luxembourg -2.34430026 0.673760445 -0.178640850
## Switzerland -2.69683318 0.896899989 -0.275048181
## Sierra Leone 5.29969299 0.894599149 -1.383380245
## Central African Republic 5.15979806 0.299745466 -0.827624027
## United States -2.15666301 0.461289598 -0.876464679
## Niger 5.39088099 -1.241241283 0.873413438
## Yemen 2.36569153 -2.652557169 0.382761370
## Mozambique 4.24412061 2.357683071 0.761360284
## Congo (Democratic Republic of the) 4.43880394 0.539558045 -1.303548841
## Liberia 3.96747850 0.317838331 -0.863492996
## Burkina Faso 3.94333744 0.053955410 -0.319270357
## Côte d'Ivoire 4.64084255 -0.598211941 -0.459138696
## Netherlands -2.72716147 1.469105166 0.308130817
## Oman -1.29017323 -2.054918869 0.293468342
## United Arab Emirates -2.14191598 -0.901306523 -0.292906252
## Mauritania 2.60820181 -1.330052714 1.623557480
## Korea (Republic of) -2.10167572 -0.277374746 -0.352782824
## Denmark -2.78120158 1.773429808 0.171302280
## Belgium -2.35585306 1.588259556 0.847919087
## Sweden -2.51286376 1.963931278 0.595735063
## Australia -2.99370191 1.188201602 -0.106765315
## Germany -2.51325622 1.320373244 0.388898222
## Togo 2.92936876 1.008638618 -0.699996373
## Pakistan 1.72263666 -1.938459968 1.749897520
## India 1.10413681 -1.992805576 1.003863214
## Austria -2.38403803 0.902207024 -0.003026204
## Turkey -0.57530772 -1.527893514 0.722212692
## France -2.19885491 0.696790265 -0.217303004
## Iraq 0.96258010 -1.699941581 2.252525442
## Italy -2.18260449 0.338893545 0.646989575
## Canada -2.26302904 1.033411190 -0.292957486
## Ethiopia 2.92275146 0.800047424 0.272497038
## Benin 2.85453769 -0.106611646 -0.837692019
## Ireland -2.62808538 0.362020337 -0.640755797
## United Kingdom -2.05713992 0.589584266 -0.407850345
## Spain -2.33209334 1.348636610 0.700913469
## Mali 4.42864302 -0.725923093 -0.189211918
## Cambodia 1.49861802 0.434332761 -0.135571575
## Finland -2.48617797 1.892468888 0.594038932
## Japan -2.20169826 -0.609972749 -0.751252991
## Malta -1.44928760 -1.028809545 -0.081348924
## Senegal 2.60628829 1.250222087 1.794207914
## Iceland -2.74910896 2.053637043 0.419653402
## Burundi 2.92716442 2.104937966 -0.100589338
## Nepal 1.43508146 1.194786495 0.463613461
## Malawi 3.49174202 1.207150760 -0.936231767
## Cameroon 3.50119915 1.065163090 0.063364397
## New Zealand -2.49492486 1.269102839 0.014971514
## Greece -1.99151909 0.052754647 -0.028068250
## Bahrain -1.83627456 -1.277276062 -0.005796514
## Israel -2.08527706 0.550448154 -0.402495966
## Bosnia and Herzegovina -0.49234291 -0.706061064 0.646486710
## Cyprus -1.52126616 -0.301457929 -0.810134452
## Gambia 3.41276063 -0.007016863 -0.901829512
## Tunisia -0.84854176 -0.830766824 1.960056511
## Papua New Guinea 2.24414173 -0.288614274 -1.445469022
## Zambia 2.37579379 0.275956183 -0.731488256
## Morocco 0.36411731 -2.047307836 0.794721315
## Egypt 0.07863364 -2.585345419 0.217108451
## Slovenia -2.16450489 0.836704528 -0.077310885
## Tanzania (United Republic of) 2.88648521 1.899305596 0.487200420
## Mauritius -0.79068829 -0.937216376 -0.221835304
## Sudan 2.60420748 -1.045381397 1.334927961
## Haiti 2.36064900 -0.561384801 -1.290302990
## The former Yugoslav Republic of Macedonia -0.62281995 0.263132896 1.292063415
## Malaysia -0.92235926 -0.984616060 -0.085310083
## Poland -1.51988207 0.193167920 -0.111253539
## Czech Republic -1.94024133 0.041663314 -0.421174852
## Algeria -0.88489346 -1.630270510 1.842073337
## Lithuania -1.47424792 0.607825365 -0.365058067
## Ghana 2.01345921 0.355485039 -1.241712687
## Portugal -1.99792574 1.040595945 0.180046605
## Trinidad and Tobago -0.43657185 0.131125955 0.074004863
## Argentina -1.45601757 0.838984586 1.037680344
## Slovakia -1.48297697 -0.019596892 -0.401935798
## Estonia -1.68580633 0.371211438 -0.569235608
## Croatia -1.30589585 0.406709990 0.159833574
## Indonesia 0.48036817 -0.567596783 0.116329285
## Iran (Islamic Republic of) -0.96436342 -2.793788341 0.315434603
## Serbia -0.83377077 0.705073607 0.951750046
## Chile -1.22310783 -0.447088566 -0.163992357
## Hungary -1.36809208 -0.493681621 -0.942204385
## Maldives -0.67904281 -0.989578003 -0.802207742
## China -0.67672686 0.359953468 0.049521929
## Lebanon -1.23268082 -2.526681094 0.043940503
## Russian Federation -0.80811646 -0.054778581 -0.829752646
## Syrian Arab Republic 0.36377295 -2.559738024 1.335263924
## Jordan -0.57591739 -2.362089025 0.910026604
## Romania -0.77188498 -0.403249223 -0.702527677
## Montenegro -1.11686706 -0.122488486 -0.280938313
## Latvia -1.31130851 0.187341613 -0.642976519
## Suriname -0.12432505 -0.997314985 -0.287838649
## Uganda 2.95628470 1.805866347 0.326356280
## Kazakhstan -0.75696807 0.549022028 -0.685887511
## Mexico -0.53292222 0.345557424 1.529477321
## Zimbabwe 2.24835708 1.728001429 0.178907487
## Peru -0.08991745 0.379961425 -0.057040060
## Thailand -0.32856107 -0.610848412 -1.075159434
## Botswana 0.51070846 -0.005038506 -1.402977382
## Kenya 2.23833543 0.681110048 -0.446481460
## Belarus -1.09207196 0.821722632 0.249889580
## Congo 2.50890521 0.529214816 -1.324187811
## Bolivia (Plurinational State of) 0.61863947 2.118079004 1.809884477
## Swaziland 2.33315345 -0.596949105 -0.306431397
## Paraguay 0.44428803 -0.500263835 0.076871476
## Bahamas -0.80193053 0.256043765 -0.862499787
## El Salvador 0.31502973 -0.082434305 0.932630500
## Bulgaria -0.83448409 0.297191795 -0.354609758
## Sri Lanka -0.75648305 -1.887423148 -0.154346306
## Azerbaijan -0.13345597 0.270223921 -0.917559403
## Bangladesh 1.17163475 -0.264831864 0.284988061
## Uruguay -1.10762421 -0.407424570 -0.791498107
## Albania -0.67704042 -0.334378287 0.233899467
## Viet Nam -0.12772727 0.635409753 -0.023983110
## Panama -0.52104983 -0.445966480 0.109079216
## Costa Rica -0.88088858 0.247783224 1.177525013
## South Africa 0.52541319 1.290788972 0.929186725
## Brazil -0.52713454 -0.437211056 -0.914408365
## Cuba -0.85690790 1.206502306 2.170553442
## Barbados -0.83161287 0.528012907 -0.591747431
## Guatemala 0.92978002 -1.021873670 0.076300346
## Ukraine -0.61256414 -0.236511138 -0.828004404
## Colombia -0.24456731 -0.012229219 -0.017543050
## Fiji -0.54064402 -1.018100867 0.043783998
## Venezuela (Bolivarian Republic of) -0.35614736 -0.328185500 -0.262447337
## Mongolia -0.58002389 0.031790917 -0.819700835
## Ecuador -0.48514379 1.062937942 1.432626667
## Tonga -0.45134514 -1.165949078 -1.371247180
## Georgia -0.38211425 -0.445864707 -0.649432206
## Namibia 0.58896079 1.429030129 0.561863707
## Dominican Republic 0.15665142 -0.232069047 -0.005138172
## Rwanda 1.08474351 3.125519996 1.399402307
## Bhutan 0.21621916 -0.205695481 -1.274792504
## Armenia -0.41497907 -0.613047804 -0.668870242
## Belize 0.03821611 -0.834020935 -0.164341793
## Philippines 0.25460237 0.009222189 0.528877460
## Kyrgyzstan -0.09841728 -0.010254730 0.233419334
## Moldova (Republic of) -0.14463141 0.316328188 -0.310093407
## Samoa -0.49340583 -2.067158649 -0.048787759
## Jamaica -0.06175153 0.015692724 -0.477625887
## Nicaragua 0.33842550 0.570944859 1.600126090
## Honduras 0.37144496 -0.477332449 0.893880234
## Lesotho 2.41352165 0.966326284 -0.469152353
## Tajikistan 0.21994425 -0.299649920 -0.471229436
## Libya -1.53980422 -1.493748457 0.207781921
## Guyana 0.81614017 -0.005126477 0.855888106
## Gabon 0.38951619 0.539298671 -1.477157917
## Myanmar 0.22129643 -0.339858349 -2.137967976
## PC4 PC5
## Qatar -3.434350558 -2.744086530
## Kuwait -2.268698795 -1.459482728
## Singapore -2.145302064 -0.773851494
## Chad -1.710289949 -0.313209384
## Saudi Arabia -1.483999840 -0.771322474
## Norway -1.429085734 -0.634591949
## Afghanistan -1.307267500 0.171076164
## Luxembourg -1.167750947 -0.659139954
## Switzerland -1.166378061 -0.293310010
## Sierra Leone -1.136109156 -0.260135837
## Central African Republic -1.039430441 -0.027250931
## United States -0.972954474 -0.523731016
## Niger -0.965501357 -1.303263726
## Yemen -0.949498266 1.002500642
## Mozambique -0.926674225 0.313358992
## Congo (Democratic Republic of the) -0.866980658 0.149837925
## Liberia -0.821010099 0.337292077
## Burkina Faso -0.808281584 0.585365213
## Côte d'Ivoire -0.803721453 -0.656805429
## Netherlands -0.797993542 -0.024473662
## Oman -0.795461293 -0.057279210
## United Arab Emirates -0.773527981 -1.767659249
## Mauritania -0.732229396 0.149028861
## Korea (Republic of) -0.691151610 0.615032289
## Denmark -0.683471922 0.056081866
## Belgium -0.676265477 -0.050602376
## Sweden -0.672261525 -0.206796715
## Australia -0.667575221 0.146603167
## Germany -0.648265885 -0.140054500
## Togo -0.644239048 0.928867228
## Pakistan -0.633197073 0.676731900
## India -0.628896340 0.852767619
## Austria -0.615634290 -0.107843670
## Turkey -0.590951238 0.618018515
## France -0.570782466 0.293158744
## Iraq -0.565370695 -0.287911201
## Italy -0.553844809 0.335992609
## Canada -0.536756869 -0.054260003
## Ethiopia -0.522758647 0.638178331
## Benin -0.510128117 0.862388776
## Ireland -0.474385756 0.158551905
## United Kingdom -0.430565377 -0.121151666
## Spain -0.408851517 0.350774791
## Mali -0.407128283 -1.027734140
## Cambodia -0.403276572 1.545495585
## Finland -0.402199400 -0.035148222
## Japan -0.370785436 0.142433142
## Malta -0.354324825 0.284860129
## Senegal -0.333524119 0.042331771
## Iceland -0.327005382 0.264383077
## Burundi -0.323059706 0.419879197
## Nepal -0.316593826 1.114794397
## Malawi -0.308982421 0.054072818
## Cameroon -0.260759929 -0.555844104
## New Zealand -0.241859264 0.196214374
## Greece -0.224756734 0.689741659
## Bahrain -0.214807455 -0.784249226
## Israel -0.191614837 0.468282001
## Bosnia and Herzegovina -0.182925626 1.209146178
## Cyprus -0.181589001 0.539000803
## Gambia -0.172392139 0.044491812
## Tunisia -0.167303534 0.654406033
## Papua New Guinea -0.125337620 1.185766443
## Zambia -0.098419324 0.125849595
## Morocco -0.091930219 0.468826463
## Egypt -0.084109671 0.384246559
## Slovenia -0.070286122 0.519533942
## Tanzania (United Republic of) -0.057669838 -0.228794202
## Mauritius -0.053895731 0.417082038
## Sudan -0.030656873 -0.712459363
## Haiti -0.018017361 0.819230616
## The former Yugoslav Republic of Macedonia -0.011789161 0.757393781
## Malaysia -0.007775677 0.250309695
## Poland 0.028293784 0.478839786
## Czech Republic 0.034533898 0.396983608
## Algeria 0.036027293 -0.035468371
## Lithuania 0.038625393 0.469887414
## Ghana 0.053762696 0.570568434
## Portugal 0.070353716 0.343488087
## Trinidad and Tobago 0.094363327 -0.478807410
## Argentina 0.100541901 -0.382426961
## Slovakia 0.108807736 0.184357318
## Estonia 0.125625669 0.348079365
## Croatia 0.137752539 0.539931997
## Indonesia 0.140336726 0.123615819
## Iran (Islamic Republic of) 0.145504912 -0.038590210
## Serbia 0.150877958 0.741356169
## Chile 0.151604505 -0.029232073
## Hungary 0.190542428 0.422324617
## Maldives 0.204993662 1.087492380
## China 0.207098455 0.948090559
## Lebanon 0.219929309 0.207450407
## Russian Federation 0.230843800 0.198560713
## Syrian Arab Republic 0.236685678 0.186780574
## Jordan 0.239258300 -0.030840898
## Romania 0.270762447 0.358282124
## Montenegro 0.271709792 0.787994084
## Latvia 0.278702422 0.357077524
## Suriname 0.298543025 -0.101737220
## Uganda 0.305231215 -0.547035571
## Kazakhstan 0.308198625 0.224421990
## Mexico 0.313139790 -0.529001350
## Zimbabwe 0.329045089 -0.135099353
## Peru 0.346459243 0.357781868
## Thailand 0.347688665 0.621808637
## Botswana 0.355643457 0.015774383
## Kenya 0.390264593 -0.349458967
## Belarus 0.394723997 0.361435870
## Congo 0.398128908 -0.669796656
## Bolivia (Plurinational State of) 0.399174576 -0.250247599
## Swaziland 0.453557485 -0.664998668
## Paraguay 0.471807391 0.004514533
## Bahamas 0.488197758 0.057513980
## El Salvador 0.498582285 -0.209535347
## Bulgaria 0.529608478 0.276207632
## Sri Lanka 0.538663748 0.378862799
## Azerbaijan 0.560060281 0.186188142
## Bangladesh 0.581364674 -0.152520365
## Uruguay 0.599660130 -0.231245152
## Albania 0.600494707 0.497187324
## Viet Nam 0.603302121 0.872257041
## Panama 0.606356209 -0.882843754
## Costa Rica 0.612051066 -0.448031363
## South Africa 0.625719759 -0.626998851
## Brazil 0.635546781 -0.242856529
## Cuba 0.652802978 -0.141187770
## Barbados 0.716841534 0.216873945
## Guatemala 0.732023267 -0.737883028
## Ukraine 0.735815695 0.683855615
## Colombia 0.739148454 -0.404110156
## Fiji 0.751446198 0.004254590
## Venezuela (Bolivarian Republic of) 0.758276118 -0.904889663
## Mongolia 0.758449744 0.440072188
## Ecuador 0.768903920 -0.699510243
## Tonga 0.781942289 0.749978939
## Georgia 0.800279490 0.377801426
## Namibia 0.814718726 -0.404258828
## Dominican Republic 0.820699804 -0.898427787
## Rwanda 0.823466529 -0.094488185
## Bhutan 0.843768126 0.344669908
## Armenia 0.845581564 0.440952425
## Belize 0.905689903 -0.354976024
## Philippines 0.943707745 -0.508529110
## Kyrgyzstan 0.984423355 0.253963473
## Moldova (Republic of) 0.994627115 0.470592772
## Samoa 1.016812799 -0.110042154
## Jamaica 1.044404927 -0.227410096
## Nicaragua 1.100369752 -1.082146297
## Honduras 1.206509214 -1.014293478
## Lesotho 1.226549015 -1.630682971
## Tajikistan 1.280122465 0.062883028
## Libya 1.311190236 -0.777481767
## Guyana 1.490657114 -1.914978588
## Gabon 1.778909476 -2.181779590
## Myanmar 2.275737391 -0.784101836
## PC6 PC7
## Qatar 0.7794028704 -0.869044046
## Kuwait 0.4433858104 -0.154579254
## Singapore 0.2087344376 -0.631938288
## Chad -0.1384298783 0.505501713
## Saudi Arabia 0.2995446875 0.814349653
## Norway 0.1057068050 -0.021435945
## Afghanistan 0.1650292048 0.419870531
## Luxembourg 0.4269134315 -0.843634606
## Switzerland 0.1865273520 -0.420485693
## Sierra Leone 0.9362103161 1.063217796
## Central African Republic 0.9515520853 0.428317282
## United States -0.3451427925 -0.002143577
## Niger -1.4900241152 -0.772778138
## Yemen 0.3877720692 -0.105861134
## Mozambique -0.4315443547 -0.191228703
## Congo (Democratic Republic of the) -0.6733926462 0.284824245
## Liberia -0.4465735263 0.044506003
## Burkina Faso -0.3017379532 -0.688616909
## Côte d'Ivoire 0.0332611178 0.771694885
## Netherlands -0.0840485953 0.299683270
## Oman 0.2559958563 -0.052902782
## United Arab Emirates 0.4626591783 -0.456221585
## Mauritania 0.3835135567 -0.206485725
## Korea (Republic of) -0.2412194847 0.254405229
## Denmark -0.0930002006 0.544652022
## Belgium 0.1848918584 -0.041147418
## Sweden 0.2506794343 -0.372812563
## Australia -0.6199683954 0.917438777
## Germany 0.1990240456 -0.001114540
## Togo -0.3551166163 0.469693499
## Pakistan 1.0547918604 -0.663717157
## India 0.2976255773 0.295702180
## Austria 0.2005291571 -0.252609069
## Turkey -0.3388083756 0.259713548
## France -0.0424944013 -0.194411707
## Iraq -0.0205434516 -0.377293351
## Italy -0.0518444082 -0.045035198
## Canada -0.0899079143 -0.292960474
## Ethiopia 0.2799208080 -0.661627350
## Benin -0.3370880163 0.133803308
## Ireland -0.3678977300 0.633980162
## United Kingdom -0.3405838816 -0.083078182
## Spain -0.2384440630 0.193672897
## Mali -1.1754044432 -0.019086504
## Cambodia 0.0893450889 -0.641196445
## Finland 0.0500578223 0.125708116
## Japan -0.1092196560 -0.295650382
## Malta -0.1902293297 -0.234558008
## Senegal 0.1229101109 -0.989313955
## Iceland -0.3893837420 0.527074951
## Burundi 1.8276366746 0.677851002
## Nepal -0.5593956686 -0.297819695
## Malawi -1.1968378168 -0.088176314
## Cameroon 0.0737457183 0.614653903
## New Zealand -0.7465178815 0.702424299
## Greece -0.4677522487 0.508801969
## Bahrain 0.3793530928 0.116840333
## Israel -0.1633957692 -0.182073253
## Bosnia and Herzegovina -0.0270242943 -0.181398112
## Cyprus 0.1192318589 -0.540291494
## Gambia -0.3474528468 -0.395655997
## Tunisia 0.3893479627 0.555015084
## Papua New Guinea 0.0929917875 -0.590106926
## Zambia -1.2011155278 0.645660843
## Morocco 0.0385729540 -0.140453926
## Egypt -0.3163885313 0.410135541
## Slovenia 0.0215875237 0.200046218
## Tanzania (United Republic of) -0.4568728938 -0.686533066
## Mauritius -0.3224620116 0.555744456
## Sudan 0.6102373647 -0.583319974
## Haiti 0.8528400232 -0.487249059
## The former Yugoslav Republic of Macedonia 0.2271429608 -0.225518600
## Malaysia 0.5943478473 -0.310882005
## Poland 0.0146512625 0.078155353
## Czech Republic 0.0054545873 0.274517961
## Algeria 0.4902133796 0.630600521
## Lithuania 0.1583897022 0.463060174
## Ghana 0.3721749488 0.232143202
## Portugal -0.1179107107 0.036420328
## Trinidad and Tobago 0.5483599732 -0.287201887
## Argentina -0.7562700661 0.998335318
## Slovakia 0.0857885836 0.030612783
## Estonia -0.1433648078 0.319266420
## Croatia 0.1183552755 -0.081785909
## Indonesia 0.0697247267 0.318509670
## Iran (Islamic Republic of) -0.3591669622 0.722852945
## Serbia 0.1960956378 0.001140764
## Chile -0.9916455370 -0.141135886
## Hungary 0.0530148587 0.202738900
## Maldives 0.2331259089 -0.434462712
## China 0.3732800063 -0.461876446
## Lebanon -0.0082757639 0.066977961
## Russian Federation 0.1868210793 0.214125170
## Syrian Arab Republic 0.0533687969 0.381438615
## Jordan 0.0960728327 0.426350376
## Romania -0.1404008807 -0.081718774
## Montenegro -0.0735326488 0.115347703
## Latvia 0.1885997063 0.131359913
## Suriname 0.2671143004 0.054969268
## Uganda -0.2663511002 -0.222926610
## Kazakhstan 0.1478071951 0.258894510
## Mexico -0.3750194353 -0.284486795
## Zimbabwe 1.0326915615 0.484252449
## Peru -0.2954092927 -0.346454974
## Thailand -0.3728835623 -0.321836338
## Botswana 0.4208028626 0.051944196
## Kenya 0.0003908410 0.192825515
## Belarus 0.2002805861 0.442680455
## Congo -0.7300608217 0.035047723
## Bolivia (Plurinational State of) -0.0005896796 0.218008164
## Swaziland 0.9379950756 1.083913498
## Paraguay -0.3868646873 -0.383448677
## Bahamas 0.2221208500 -0.684132379
## El Salvador -0.5208362283 -0.298642007
## Bulgaria -0.1548721090 -0.094755523
## Sri Lanka 0.1134797992 0.181878318
## Azerbaijan 0.2228914450 -0.691575682
## Bangladesh -0.2801004193 -0.761911579
## Uruguay -0.8520051813 0.123425781
## Albania 0.3719985522 -0.758087155
## Viet Nam 0.1370254776 -0.819669561
## Panama -0.7794109136 -0.261632639
## Costa Rica -0.5801338772 -0.205490683
## South Africa 0.8641957710 0.880893184
## Brazil -0.9387997247 0.259841102
## Cuba -0.1046529231 -0.129211242
## Barbados -0.5554976971 0.143179329
## Guatemala -0.6663882061 -0.519546181
## Ukraine -0.0402324586 0.377186088
## Colombia -0.5193595878 -0.089128360
## Fiji -0.2586706159 0.954688210
## Venezuela (Bolivarian Republic of) -0.8202034504 0.165151944
## Mongolia 0.3664312347 0.382144832
## Ecuador -0.6229513015 0.060219694
## Tonga 0.0223647898 0.463117389
## Georgia -0.4329675048 -0.129134324
## Namibia 0.6531061286 -0.279270660
## Dominican Republic -0.9977964148 -0.136457443
## Rwanda 1.4882291993 -0.291101527
## Bhutan 0.1354618087 -0.160322615
## Armenia 0.1512678759 -0.513028090
## Belize -0.5257968357 0.220621752
## Philippines 0.5735628979 -0.188977457
## Kyrgyzstan 0.4470455299 -0.082545945
## Moldova (Republic of) 0.3856501921 -0.575902862
## Samoa 0.1862354772 0.197408492
## Jamaica -0.5596006430 -0.552066779
## Nicaragua -0.6811130040 -0.518838884
## Honduras -0.3160976543 -0.377880488
## Lesotho 1.0758622144 1.138717611
## Tajikistan 0.3475875471 -0.503606390
## Libya 0.9330807855 0.570157750
## Guyana 0.3639403210 -0.027849786
## Gabon -0.1202364303 0.236755731
## Myanmar 1.7676007033 -0.806312423
## PC8
## Qatar -0.544730977
## Kuwait -0.577682871
## Singapore -0.058473974
## Chad 0.720298478
## Saudi Arabia -0.334517472
## Norway -0.123820241
## Afghanistan -0.184925327
## Luxembourg 0.088632850
## Switzerland 0.149931235
## Sierra Leone 1.384425435
## Central African Republic 0.645440709
## United States -0.117040628
## Niger 0.358842606
## Yemen -0.164016287
## Mozambique -0.793719582
## Congo (Democratic Republic of the) 0.535636325
## Liberia 0.479166548
## Burkina Faso -0.537702988
## Côte d'Ivoire 0.135308345
## Netherlands 0.062106304
## Oman 0.014389692
## United Arab Emirates -0.094267148
## Mauritania -0.121129103
## Korea (Republic of) 0.313970540
## Denmark -0.054496911
## Belgium 0.045294793
## Sweden 0.117156987
## Australia 0.122261331
## Germany 0.085755248
## Togo -0.266806482
## Pakistan -0.193259634
## India -0.049179210
## Austria 0.146278110
## Turkey -0.161342462
## France 0.236234797
## Iraq -0.466207100
## Italy 0.341553582
## Canada 0.172678430
## Ethiopia 0.123086514
## Benin -0.550244820
## Ireland 0.142599173
## United Kingdom 0.075500247
## Spain 0.220613119
## Mali -0.051504428
## Cambodia -0.343470935
## Finland 0.056001135
## Japan 0.466519718
## Malta 0.261913162
## Senegal -0.050182834
## Iceland 0.162465364
## Burundi 0.872691209
## Nepal -0.312497632
## Malawi 0.078962173
## Cameroon 0.012924475
## New Zealand 0.134924597
## Greece 0.204463481
## Bahrain 0.116941358
## Israel 0.298293132
## Bosnia and Herzegovina -0.037480926
## Cyprus 0.238419359
## Gambia -0.154919425
## Tunisia 0.085563570
## Papua New Guinea -0.568359399
## Zambia -0.794475565
## Morocco 0.283122879
## Egypt -0.229456272
## Slovenia 0.201136871
## Tanzania (United Republic of) 0.031354925
## Mauritius 0.021314559
## Sudan 0.195483201
## Haiti 0.202062572
## The former Yugoslav Republic of Macedonia -0.140384418
## Malaysia 0.025054670
## Poland -0.006129686
## Czech Republic 0.125812767
## Algeria 0.366520434
## Lithuania -0.330632189
## Ghana -0.037526847
## Portugal 0.229393237
## Trinidad and Tobago -0.252645792
## Argentina 0.001276173
## Slovakia -0.049162831
## Estonia -0.047421539
## Croatia 0.040691396
## Indonesia 0.017986999
## Iran (Islamic Republic of) 0.148533006
## Serbia -0.152580467
## Chile 0.323884955
## Hungary -0.059973220
## Maldives 0.197975099
## China 0.025315956
## Lebanon 0.488700890
## Russian Federation -0.481743543
## Syrian Arab Republic -0.199383676
## Jordan 0.176778023
## Romania -0.075240568
## Montenegro -0.024931977
## Latvia -0.169022123
## Suriname 0.088444159
## Uganda -0.572916878
## Kazakhstan -0.575145605
## Mexico 0.070725796
## Zimbabwe 0.016998870
## Peru 0.026102641
## Thailand -0.146109071
## Botswana -0.413437307
## Kenya 0.026234807
## Belarus -0.466662216
## Congo -0.017934893
## Bolivia (Plurinational State of) -0.183113873
## Swaziland -1.057558817
## Paraguay 0.035958347
## Bahamas 0.017655734
## El Salvador -0.142869401
## Bulgaria -0.221248901
## Sri Lanka 0.190906050
## Azerbaijan -0.410686138
## Bangladesh 0.135731141
## Uruguay 0.027424769
## Albania 0.295251360
## Viet Nam 0.087508486
## Panama 0.311755507
## Costa Rica 0.303047765
## South Africa -1.053520373
## Brazil -0.015904503
## Cuba 0.481755252
## Barbados 0.027150536
## Guatemala 0.082547026
## Ukraine -0.298001037
## Colombia 0.040791381
## Fiji -0.192307926
## Venezuela (Bolivarian Republic of) 0.135464359
## Mongolia -0.229052691
## Ecuador 0.136462760
## Tonga 0.306781672
## Georgia 0.042053001
## Namibia -0.508079106
## Dominican Republic -0.001090328
## Rwanda 0.123998389
## Bhutan -0.074744150
## Armenia 0.092802384
## Belize -0.319139671
## Philippines -0.097634576
## Kyrgyzstan -0.042950292
## Moldova (Republic of) -0.213918903
## Samoa 0.295420249
## Jamaica 0.212157687
## Nicaragua 0.168834839
## Honduras 0.264847993
## Lesotho -0.302736425
## Tajikistan -0.216743506
## Libya 0.150608995
## Guyana 0.283691259
## Gabon -0.016809690
## Myanmar 0.464249830
Give your personal interpretations of the first two principal component dimensions based on the biplot drawn after PCA on the standardized human data.
(0-2 points)I already touched on this in the previous section, but as the first
principal components align with Mat.Mor,
Ado.Birth, Life.Exp, GNI,
Edu.Exp, and Edu2.FM, it seems to capture the
general wellbeing of society. Or rather, the lack thereof, since it’s
pointing in the other direction.
The second axis measures gender equality, as it aligns well with
Parli.F and Labo.FM.
There are some additional interesting correlations in the original
variables with the PCs. Mat.Mor and Edu.Exp
both point slightly in the positive PC2 direction. This doesn’t make a
lot of intuitive sense if we consider PC2 to capture gender equality or
liberal values. However, Mat.Mor is pointing in the
opposite direction from the social equality trend line, which may
suggest that the principal components aren’t aligned in any
particularily meaningful direction. PCA has maximiced the dispersion
along PC1 and PC2, but it does not necessarily mean that the axes are
meaningful in themselves, the axes may be slightly misaligned compared
to the labels I’ve given them.
This idea is supported by the fact that PC1 is bimodal, so the second, smaller mode might then be pulling the PCA slightly off the trend line. The other mode consists mostly of African countries and other developing nations. The fact that the other variables are distributed differently here might be due to artifacts of colonialism or some other big systemic differences between developed and developing nations.
ggplot(data.frame(pca_scaled$x), aes(x=PC1)) + geom_density()
fig. 3.2, PC1 distribution, clearly bimodal
Parli.F and Labo.FM are pointing in
slightly different directions from each other, with
Parli.FM slightly aligned with GNI, and
Labo.FM completely orthogonal to it. This suggests that
female labor participation has no effect on GNI (it doesn’t matter for
the economy what the sex of a worker is), but female leadership has a
positive correlation with GNI (of course, we don’t have a causal link
here, only a correlation).
Just for fun, let’s use the log transformed data to see if some of those correlations change
pca_log_scaled <- human.log %>% scale %>% as.data.frame %>% prcomp
summary(pca_log_scaled)
biplot(pca_log_scaled, cex = c(0.8, 1), col = c("grey40", "#dd4028"))
fig. 4.1, biplot for PCA of the standardized and log transformed data
Sure enough, if we look at the summary, we see that the first PC now captures 57% of the variance compared to 53.6% before, a better result. The first four PCs now capture 90% of the variance, after that we get diminishing returns.
pca_log_scaled
## Standard deviations (1, .., p=8):
## [1] 2.1505310 1.1259172 0.8681023 0.7727862 0.5558253 0.4410956 0.3827107
## [8] 0.3267292
##
## Rotation (n x k) = (8 x 8):
## PC1 PC2 PC3 PC4 PC5
## Edu2.FM -0.32850815 0.01067147 -0.35894548 0.77560185 -0.36060516
## Labo.FM 0.03263713 0.72942195 -0.61074433 -0.23458714 0.05083415
## Life.Exp -0.42542035 -0.05622607 0.11141581 -0.03806539 0.26227538
## Edu.Exp -0.42035073 0.09697456 -0.06818573 -0.03839701 0.44933876
## GNI -0.43314660 -0.09480875 -0.01735880 0.06836804 0.13860792
## Mat.Mor 0.43672168 0.01162213 -0.02391875 0.21779671 -0.12263401
## Ado.Birth 0.38284906 0.03041796 -0.03340255 0.48515646 0.74256569
## Parli.F -0.09178670 0.66724454 0.69216873 0.23021937 -0.10502880
## PC6 PC7 PC8
## Edu2.FM 0.05763161 -0.12121944 -0.11626173
## Labo.FM 0.12840535 0.11338286 0.08313215
## Life.Exp 0.73375796 -0.22248569 0.38118853
## Edu.Exp -0.56661984 -0.53272731 0.03187199
## GNI -0.27117249 0.74980849 0.37876162
## Mat.Mor -0.17790440 -0.22051498 0.81597528
## Ado.Birth 0.11877107 0.16353812 -0.15412247
## Parli.F -0.03795809 0.03959254 -0.01489879
Looking closer at the coefficients, PC1, PC2, and PC3 have roughly the same interpretation as before. PC4 has changed now, but it is still as difficult to interpret as before.
The tea data comes from the FactoMineR package and it is measured with a questionnaire on tea: 300 individuals were asked how they drink tea (18 questions) and what are their product’s perception (12 questions). In addition, some personal details were asked (4 questions).
Load the tea dataset and convert its character variables to factors:
tea <- read.csv(“https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv”, stringsAsFactors = TRUE)
Explore the data briefly: look at the structure and the dimensions of the data. Use View(tea) to browse its contents, and visualize the data.Use Multiple Correspondence Analysis (MCA) on the tea data (or on just certain columns of the data, it is up to you!).Interpret the results of the MCA and draw at least the variable biplot of the analysis. You can also explore other plotting options for MCA.Comment on the output of the plots. (0-4 points)tea <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", stringsAsFactors = TRUE)
dim(tea)
## [1] 300 36
#View(tea)
The dataset seems to consist of 300 rows of 36 answers to questionnaire, each row represents one questionnaire.
Summary statistics follow.
summary(tea)
## breakfast tea.time evening lunch
## breakfast :144 Not.tea time:131 evening :103 lunch : 44
## Not.breakfast:156 tea time :169 Not.evening:197 Not.lunch:256
##
##
##
##
##
## dinner always home work
## dinner : 21 always :103 home :291 Not.work:213
## Not.dinner:279 Not.always:197 Not.home: 9 work : 87
##
##
##
##
##
## tearoom friends resto pub
## Not.tearoom:242 friends :196 Not.resto:221 Not.pub:237
## tearoom : 58 Not.friends:104 resto : 79 pub : 63
##
##
##
##
##
## Tea How sugar how
## black : 74 alone:195 No.sugar:155 tea bag :170
## Earl Grey:193 lemon: 33 sugar :145 tea bag+unpackaged: 94
## green : 33 milk : 63 unpackaged : 36
## other: 9
##
##
##
## where price age sex
## chain store :192 p_branded : 95 Min. :15.00 F:178
## chain store+tea shop: 78 p_cheap : 7 1st Qu.:23.00 M:122
## tea shop : 30 p_private label: 21 Median :32.00
## p_unknown : 12 Mean :37.05
## p_upscale : 53 3rd Qu.:48.00
## p_variable :112 Max. :90.00
##
## SPC Sport age_Q frequency
## employee :59 Not.sportsman:121 +60 :38 +2/day :127
## middle :40 sportsman :179 15-24:92 1 to 2/week: 44
## non-worker :64 25-34:69 1/day : 95
## other worker:20 35-44:40 3 to 6/week: 34
## senior :35 45-59:61
## student :70
## workman :12
## escape.exoticism spirituality healthy
## escape-exoticism :142 Not.spirituality:206 healthy :210
## Not.escape-exoticism:158 spirituality : 94 Not.healthy: 90
##
##
##
##
##
## diuretic friendliness iron.absorption
## diuretic :174 friendliness :242 iron absorption : 31
## Not.diuretic:126 Not.friendliness: 58 Not.iron absorption:269
##
##
##
##
##
## feminine sophisticated slimming exciting
## feminine :129 Not.sophisticated: 85 No.slimming:255 exciting :116
## Not.feminine:171 sophisticated :215 slimming : 45 No.exciting:184
##
##
##
##
##
## relaxing effect.on.health
## No.relaxing:113 effect on health : 66
## relaxing :187 No.effect on health:234
##
##
##
##
##
Factominer documentation (https://rdrr.io/cran/FactoMineR/man/tea.html) says the first 18 vars are about how they drink tea, 19 is age, and the rest are personal questions and “product’s perception” which I’m choosing to interpret as how they think about tea.
There are a lot of variables here, so let’s choose a sensible subset of them to look at. Let’s choose the following variables, which I’m interpreting free-hand here, since I can’t find much metadata:
Tea: what kind of tea out of three types the respondent
prefers (green, black, Earl Grey)price: the price level of the tea that the respondent
prefersHow: whether the respondent takes tea as is, with
lemon, with milk, or in some other waysex: the sex of the respondentSPC: some kind of general social group for the
respondent (student, employee (white collar?), middle (management?),
senior)age_Q: the age group of the respondentfrequency: how often the respondent drinks teaof these, Tea, price, and How
are “active” variables, and the rest are “supplementary” variables.
As a sidenote, this dataset is very badly documented.
tea_filtered <- tea %>% dplyr::select(one_of("Tea", "price", "How", "sex", "SPC", "age_Q", "frequency"))
summary(tea_filtered)
## Tea price How sex SPC
## black : 74 p_branded : 95 alone:195 F:178 employee :59
## Earl Grey:193 p_cheap : 7 lemon: 33 M:122 middle :40
## green : 33 p_private label: 21 milk : 63 non-worker :64
## p_unknown : 12 other: 9 other worker:20
## p_upscale : 53 senior :35
## p_variable :112 student :70
## workman :12
## age_Q frequency
## +60 :38 +2/day :127
## 15-24:92 1 to 2/week: 44
## 25-34:69 1/day : 95
## 35-44:40 3 to 6/week: 34
## 45-59:61
##
##
Let’s look at a summary of the MCA of the dataset based on these specs above.
mca=MCA(tea_filtered,quali.sup=4:7 ,graph=F)
summary(mca)
##
## Call:
## MCA(X = tea_filtered, quali.sup = 4:7, graph = F)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 0.423 0.399 0.383 0.355 0.335 0.330 0.316
## % of var. 12.688 11.967 11.497 10.653 10.043 9.907 9.484
## Cumulative % of var. 12.688 24.655 36.152 46.805 56.848 66.755 76.239
## Dim.8 Dim.9 Dim.10
## Variance 0.283 0.271 0.238
## % of var. 8.490 8.142 7.129
## Cumulative % of var. 84.729 92.871 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr
## 1 | 0.720 0.409 0.056 | 1.341 1.503 0.196 | 0.789 0.541
## 2 | 0.782 0.481 0.216 | 0.593 0.294 0.124 | 0.071 0.004
## 3 | -0.462 0.168 0.231 | 0.110 0.010 0.013 | -0.707 0.435
## 4 | -0.462 0.168 0.231 | 0.110 0.010 0.013 | -0.707 0.435
## 5 | -0.462 0.168 0.231 | 0.110 0.010 0.013 | -0.707 0.435
## 6 | -0.357 0.100 0.027 | -0.078 0.005 0.001 | 0.344 0.103
## 7 | -0.462 0.168 0.231 | 0.110 0.010 0.013 | -0.707 0.435
## 8 | 0.782 0.481 0.216 | 0.593 0.294 0.124 | 0.071 0.004
## 9 | 0.497 0.195 0.083 | 0.222 0.041 0.016 | 0.544 0.257
## 10 | 1.104 0.961 0.443 | -0.678 0.384 0.167 | 0.138 0.016
## cos2
## 1 0.068 |
## 2 0.002 |
## 3 0.542 |
## 4 0.542 |
## 5 0.542 |
## 6 0.025 |
## 7 0.542 |
## 8 0.002 |
## 9 0.099 |
## 10 0.007 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2
## black | 1.366 36.278 0.611 13.516 | -0.083 0.143 0.002
## Earl Grey | -0.440 9.794 0.348 -10.207 | 0.312 5.225 0.175
## green | -0.493 2.106 0.030 -2.996 | -1.636 24.612 0.331
## p_branded | -0.497 6.176 0.115 -5.856 | -0.218 1.257 0.022
## p_cheap | 1.116 2.289 0.030 2.982 | 1.313 3.361 0.041
## p_private label | 0.018 0.002 0.000 0.084 | -0.132 0.102 0.001
## p_unknown | 0.314 0.310 0.004 1.108 | 2.953 29.139 0.363
## p_upscale | 1.063 15.724 0.242 8.512 | -0.874 11.272 0.164
## p_variable | -0.188 1.036 0.021 -2.504 | 0.225 1.575 0.030
## alone | -0.275 3.870 0.140 -6.477 | -0.328 5.831 0.199
## v.test Dim.3 ctr cos2 v.test
## black -0.825 | 0.099 0.209 0.003 0.977 |
## Earl Grey 7.240 | -0.247 3.420 0.110 -5.741 |
## green -9.947 | 1.224 14.341 0.185 7.443 |
## p_branded -2.566 | 0.459 5.803 0.098 5.403 |
## p_cheap 3.509 | 0.315 0.201 0.002 0.841 |
## p_private label -0.626 | 1.039 6.569 0.081 4.928 |
## p_unknown 10.421 | 1.519 8.030 0.096 5.362 |
## p_upscale -6.999 | 0.310 1.476 0.021 2.483 |
## p_variable 2.999 | -0.913 27.080 0.497 -12.188 |
## alone -7.721 | -0.153 1.330 0.044 -3.614 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Tea | 0.611 0.359 0.207 |
## price | 0.324 0.559 0.565 |
## How | 0.334 0.279 0.378 |
##
## Supplementary categories (the 10 first)
## Dim.1 cos2 v.test Dim.2 cos2 v.test Dim.3 cos2
## F | -0.099 0.014 -2.071 | -0.001 0.000 -0.018 | -0.092 0.012
## M | 0.145 0.014 2.071 | 0.001 0.000 0.018 | 0.135 0.012
## employee | -0.153 0.006 -1.307 | -0.104 0.003 -0.891 | 0.027 0.000
## middle | 0.179 0.005 1.213 | 0.095 0.001 0.645 | 0.033 0.000
## non-worker | 0.281 0.021 2.531 | -0.214 0.012 -1.927 | -0.010 0.000
## other worker | 0.221 0.003 1.021 | 0.174 0.002 0.804 | 0.108 0.001
## senior | 0.226 0.007 1.419 | 0.086 0.001 0.543 | 0.009 0.000
## student | -0.350 0.037 -3.343 | 0.108 0.004 1.027 | -0.090 0.002
## workman | -0.327 0.004 -1.154 | 0.167 0.001 0.588 | 0.133 0.001
## +60 | 0.702 0.072 4.624 | -0.316 0.014 -2.078 | -0.033 0.000
## v.test
## F -1.930 |
## M 1.930 |
## employee 0.229 |
## middle 0.225 |
## non-worker -0.091 |
## other worker 0.498 |
## senior 0.056 |
## student -0.861 |
## workman 0.470 |
## +60 -0.218 |
##
## Supplementary categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## sex | 0.014 0.000 0.012 |
## SPC | 0.068 0.020 0.004 |
## age_Q | 0.130 0.022 0.014 |
## frequency | 0.011 0.007 0.026 |
Let’s start with a plot of the individuals (ind):
par(mfrow=c(2,2))
plot(mca, graph.type="ggplot", invisible=c("var","quali.sup","quanti.sup"),cex=0.8)
I see no real clusters, but there seem to be some duplicates. Dimension 1 covers 12.6% of the variance, and dimension 2 covers 12% of the variance, roughly the same. The scatter plot forms a triangle in this space, with a wide base forming around the Dim 1 = 0 and extending to the right.
Now let’s plot the var variables — Tea,
price, and How — which the plot function for
FactoMineR’s MCA colors black, red, and green, respectively.
plot(mca, graph.type="ggplot", invisible=c("ind","quali.sup","quanti.sup"), cex=0.8, habillage="quali")
Dimension 1 seems to cover: - most strongly (Tea: 0.61), whether the
respondent prefers black tea or not (black +1.36, green -0.49, Earl Grey
-0.44) - then (How: 0.334), whether the respondenr likes to add things
to their tea (alone -0.28, lemon, milk, other all positive +) - and
finally, no clear correlation with price, since both upscale and cheap
are positive in dimension 1
Dimension 2 seems to cover: - whether the respondent is unlikely to prefer green tea (-1.636) - whether the respondent likes cheap tea (cheap +1.313, upscale -0.874)
The p_unknown category in the price has the highest
coefficient here, which makes this one tough to interpret.
Finally, let’s look at the demographics of the individuals and see how they distribute over this space:
plot(mca, graph.type="ggplot", invisible=c("ind","var","quanti.sup"), cex=0.8, habillage="quali")
It seems that age is the best explaining variable for the differences in respondents tea drinking habits (dim 1: 0.130, dim 2: 0.022), followed by SPC, which does encode some of the same information as age, so perhaps it is redundant. If we interpret the dimensions from before we could roughly say that: - older people are more likely to prefer generic black tea to green tea or Earl Grey - younger people are more likely to prefer Earl Grey - green tea is most likely to be preferred by people in the 25-34 age range who are “employees” (office workers?) - older people prefer to add milk, lemon, or something else to their tea
There is also a slight difference in the distribution of answers between sexes: - men more likely to prefer black tea - women more likely to prefer Ear Grey or green tea
The frequency of tea drinking is all over the place and forms no clear trendline in the biplot.
Social equality trend line
It is good to point out at this point that there is a clear trend line in the upper left corner of the biplot, moving from the right bottom to the top left. It seems like most of the countries are in this trend line. There is then a large spread of the rest of the countries, which is why this trend is not aligned with the principal components. Let’s call this trend line the social equality trend line, as all the countries at the top of this trend line have social welfare programs, such as universal healthcare, strong unions, and low GINI indices. (A bit hand-wavy, but this is a course assignment, not a research paper).